Closed Bug 1305485 Opened 8 years ago Closed 8 years ago

Windows TaskCluster workers should use `hg robustcheckout`, not put data on c:\

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(firefox52 fixed)

RESOLVED FIXED
Tracking Status
firefox52 --- fixed

People

(Reporter: gps, Assigned: grenade)

References

Details

Attachments

(1 file)

There are a few major issues with hg performance on TaskCluster Windows workers currently:

1) Data on C:\ makes I/O slow per bug 1305174
2) Not using a shared / pooled storage repo
3) May not be purging/updating working copy robustly

#1 can be fixed by moving storage to z:\ and performing a clone at task time, not at AMI bake time.

#2 and #3 can be fixed by using `hg robustcheckout`. See https://hg.mozilla.org/mozilla-central/file/default/testing/docker/recipes/run-task for how `hg robustcheckout` should be used.

Essentially, automation should invoke the following for ideal Mercurial interaction:

  hg robustcheckout \
    --sharebase z:\hg-shared \ # Tells Mercurial to pool storage
    --purge \ # To delete untracked files from the source checkout (optional)
    --upstream https://hg.mozilla.org/mozilla-unified \ # Likely value
    --revision <sha1> \
    https://hg.mozilla.org/path/to/repo \
    z:\path\to\checkout

The fingerprint for hg.mozilla.org should also be obtained from a TC secret and fed into the commandline (for now - I may move fingerprint lookup into `hg robustcheckout`).

The following should also exist in the global mercurial.ini:

# By default the progress bar starts after 3s and updates every 0.1s. We
# change this so it shows and updates every 1.0s.
# We also tell progress to assume a TTY is present so updates are printed
# even if there is no known TTY.
[progress]
delay = 1.0
refresh = 1.0
assume-tty = true

[web]
cacerts = /path/to/cacerts-bundle-file

[extensions]
robustcheckout = /path/to/robustcheckout.py

You can copy robustcheckout from https://hg.mozilla.org/hgcustom/version-control-tools/file/tip/hgext/robustcheckout/__init__.py into the base AMI or Puppet so it is available to the base image.
Comment on attachment 8794966 [details]
Bug 1305485 - use robustcheckout with ephemeral storage (tc windows);

https://reviewboard.mozilla.org/r/81162/#review79772

I abbreviated the logic a bit in my Bugzilla comment. Please look at the logic inside testing/docker/recipes/run-task:vcs_checkout() and duplicate that. I'm fine deferring the hg.mo fingerprint setting to a follow-up.
Attachment #8794966 - Flags: review?(gps) → review-
(In reply to Rob Thijssen (:grenade - GMT) from comment #4)
> robust checkout shared folder and permissions on y:
> https://github.com/mozilla-releng/OpenCloudConfig/commit/
> a41b2c9df1641b2776a283324acc78675a0ed251

You should probably vendor robustcheckout.py into that repo. Or, you should pin the revision used. e.g. https://hg.mozilla.org/hgcustom/version-control-tools/raw-file/af7287700528/hgext/robustcheckout/__init__.py.

The benefit to vendoring is if hg.mozilla.org goes down, you'll still be able to generate AMIs.

The benefit to pinning the revision is if we screw up the file in version-control-tools, that mistake won't get propagated to Windows workers. Also, if hg.mozilla.org is compromised or you aren't doing certificate checking properly in that AMI setup script, a malicious file executed on every build could get executed.

Moral of the story: vendoring reduces your exposure to down services and mitigates attack surface area.
Comment on attachment 8794966 [details]
Bug 1305485 - use robustcheckout with ephemeral storage (tc windows);

https://reviewboard.mozilla.org/r/81162/#review79778

I /think/ this will work. I'm anxious to see a Try result with this. I'm not sure we've even verified `hg robustcheckout` works on Windows. Please ping me when you have a Try push so I can verify this is doing the right thing. I'm also curious about the performance!
Attachment #8794966 - Flags: review?(gps) → review+
Blocks: 1305597
Hidden on all trees, so you'll need to click the "Excluded Jobs" in the grey bar at the top of treeherder to see them.
gps: it almost works. here's the output from a try push (https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff69a411ba3e09723fcfea64041709fa13c148c8&exclusion_profile=false&selectedJob=28093375):


[taskcluster 2016-09-27T09:37:59.319Z] Worker Type settings:
[taskcluster 2016-09-27T09:38:04.828Z]   {
[taskcluster 2016-09-27T09:38:04.828Z]     "aws": {
[taskcluster 2016-09-27T09:38:04.828Z]       "ami-id": "ami-48d7935f",
[taskcluster 2016-09-27T09:38:04.828Z]       "availability-zone": "us-east-1d",
[taskcluster 2016-09-27T09:38:04.828Z]       "instance-id": "i-0f42b93583ff186d5",
[taskcluster 2016-09-27T09:38:04.828Z]       "instance-type": "c3.2xlarge",
[taskcluster 2016-09-27T09:38:04.828Z]       "public-ipv4": "54.227.52.26"
[taskcluster 2016-09-27T09:38:04.828Z]     },
[taskcluster 2016-09-27T09:38:04.828Z]     "generic-worker": {
[taskcluster 2016-09-27T09:38:04.828Z]       "go-arch": "amd64",
[taskcluster 2016-09-27T09:38:04.828Z]       "go-os": "windows",
[taskcluster 2016-09-27T09:38:04.828Z]       "go-version": "go1.7",
[taskcluster 2016-09-27T09:38:04.828Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v5.2.0",
[taskcluster 2016-09-27T09:38:04.828Z]       "version": "5.2.0"
[taskcluster 2016-09-27T09:38:04.828Z]     },
[taskcluster 2016-09-27T09:38:04.828Z]     "machine-setup": {
[taskcluster 2016-09-27T09:38:04.828Z]       "maintainer": "rthijssen@mozilla.com",
[taskcluster 2016-09-27T09:38:04.828Z]       "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/76edb6b28de33f1569425ae51ce912e41487b74d/userdata/Manifest/win2012.json"
[taskcluster 2016-09-27T09:38:04.828Z]     }
[taskcluster 2016-09-27T09:38:04.828Z]   }
[taskcluster 2016-09-27T09:38:04.828Z] === Task Starting ===
[taskcluster 2016-09-27T09:38:04.829Z] Executing command 0: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision ff69a411ba3e09723fcfea64041709fa13c148c8 https://hg.mozilla.org/try/ .\build\src

Z:\task_1474968877>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision ff69a411ba3e09723fcfea64041709fa13c148c8 https://hg.mozilla.org/try/ .\build\src 
ensuring https://hg.mozilla.org/try/@ff69a411ba3e09723fcfea64041709fa13c148c8 is available at .\build\src
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
Traceback (most recent call last):
  File "mercurial\dispatch.pyc", line 204, in _runcatch
  File "mercurial\dispatch.pyc", line 880, in _dispatch
  File "mercurial\dispatch.pyc", line 637, in runcommand
  File "mercurial\dispatch.pyc", line 1010, in _runcommand
  File "mercurial\dispatch.pyc", line 971, in checkargs
  File "mercurial\dispatch.pyc", line 877, in <lambda>
  File "mercurial\util.pyc", line 1036, in check
  File "C:/mozilla-build/robustcheckout.py", line 149, in robustcheckout
  File "C:/mozilla-build/robustcheckout.py", line 229, in _docheckout
  File "mercurial\hg.pyc", line 439, in clone
  File "mercurial\hg.pyc", line 175, in peer
  File "mercurial\hg.pyc", line 152, in _peerorrepo
  File "mercurial\httppeer.pyc", line 297, in instance
  File "mercurial\httppeer.pyc", line 77, in _fetchcaps
  File "mercurial\httppeer.pyc", line 206, in _call
  File "mercurial\httppeer.pyc", line 153, in _callstream
  File "urllib2.pyc", line 429, in open
  File "urllib2.pyc", line 447, in _open
  File "urllib2.pyc", line 407, in _call_chain
  File "mercurial\url.pyc", line 375, in https_open
  File "mercurial\keepalive.pyc", line 264, in do_open
URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)>
abort: error: EOF occurred in violation of protocol (_ssl.c:590)
[taskcluster 2016-09-27T09:41:39.107Z] Exit Code: 0
[taskcluster 2016-09-27T09:41:39.107Z] === Task Finished ===
[taskcluster 2016-09-27T09:41:39.107Z] Task Duration: 3m34.279s
actually, that may have just been spectacularly bad luck. a subsequent run (no change to worker/ami, but different instance) performed the robustcheckout without issue...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff69a411ba3e09723fcfea64041709fa13c148c8&exclusion_profile=false&selectedJob=28093782
Blocks: 450645
Looks like the clone is 2-3x faster. It could be faster if there were more IOPS. But this is ephemeral storage and I don't think you have any control over IOPS there. Still promising.
Note, there is support for task-defined caches as of generic worker 5.4.0. It isn't heavily tested yet, and the docs in the review phase of a PR at the moment. What this gets you is some more control over who can use your cache, rather than any task that runs on the worker...
patch content unchanged, just updated for latest m-c merge compatibility and retested on tc win builders with hg 3.9.1 updates applied.
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Keywords: checkin-needed
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/68be44803595
use robustcheckout with ephemeral storage (tc windows); r=gps
Keywords: checkin-needed
seems to be some file lock issues we need to add wait/retry to robustcheckout for.

[taskcluster 2016-10-13T07:49:03.459Z] Worker Type settings:
[taskcluster 2016-10-13T07:49:03.471Z]   {
[taskcluster 2016-10-13T07:49:03.471Z]     "aws": {
[taskcluster 2016-10-13T07:49:03.471Z]       "ami-id": "ami-431f5723",
[taskcluster 2016-10-13T07:49:03.471Z]       "availability-zone": "us-west-1b",
[taskcluster 2016-10-13T07:49:03.471Z]       "instance-id": "i-01c9e735154d130a7",
[taskcluster 2016-10-13T07:49:03.471Z]       "instance-type": "c3.2xlarge",
[taskcluster 2016-10-13T07:49:03.471Z]       "public-ipv4": "54.153.58.136"
[taskcluster 2016-10-13T07:49:03.471Z]     },
[taskcluster 2016-10-13T07:49:03.471Z]     "generic-worker": {
[taskcluster 2016-10-13T07:49:03.472Z]       "go-arch": "amd64",
[taskcluster 2016-10-13T07:49:03.472Z]       "go-os": "windows",
[taskcluster 2016-10-13T07:49:03.472Z]       "go-version": "go1.7",
[taskcluster 2016-10-13T07:49:03.472Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v5.2.0",
[taskcluster 2016-10-13T07:49:03.472Z]       "version": "5.2.0"
[taskcluster 2016-10-13T07:49:03.472Z]     },
[taskcluster 2016-10-13T07:49:03.472Z]     "machine-setup": {
[taskcluster 2016-10-13T07:49:03.472Z]       "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/593a69f40311e4b9fc3b3e9213a9126fa307c11f/userdata/Manifest/win2012.json"
[taskcluster 2016-10-13T07:49:03.472Z]     }
[taskcluster 2016-10-13T07:49:03.472Z]   }
[taskcluster 2016-10-13T07:49:03.472Z] === Task Starting ===
[taskcluster 2016-10-13T07:49:03.473Z] Executing command 0: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 7480f3d08b18ee0d347a944020509c2d57bb5cf0 https://hg.mozilla.org/integration/autoland/ .\build\src

Z:\task_1476344616>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 7480f3d08b18ee0d347a944020509c2d57bb5cf0 https://hg.mozilla.org/integration/autoland/ .\build\src 
ensuring https://hg.mozilla.org/integration/autoland/@7480f3d08b18ee0d347a944020509c2d57bb5cf0 is available at .\build\src
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
(sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
searching for changes
adding changesets
Traceback (most recent call last):
  File "mercurial\dispatch.pyc", line 204, in _runcatch
  File "mercurial\dispatch.pyc", line 880, in _dispatch
  File "mercurial\dispatch.pyc", line 637, in runcommand
  File "mercurial\dispatch.pyc", line 1010, in _runcommand
  File "mercurial\dispatch.pyc", line 971, in checkargs
  File "mercurial\dispatch.pyc", line 877, in <lambda>
  File "mercurial\util.pyc", line 1036, in check
  File "C:/mozilla-build/robustcheckout.py", line 149, in robustcheckout
  File "C:/mozilla-build/robustcheckout.py", line 229, in _docheckout
  File "mercurial\hg.pyc", line 496, in clone
  File "mercurial\hg.pyc", line 385, in clonewithshare
  File "mercurial\exchange.pyc", line 1186, in pull
  File "mercurial\exchange.pyc", line 1327, in _pullbundle2
  File "mercurial\bundle2.pyc", line 355, in processbundle
  File "mercurial\bundle2.pyc", line 432, in _processpart
  File "mercurial\bundle2.pyc", line 1361, in handlechangegroup
  File "mercurial\changegroup.pyc", line 316, in apply
  File "mercurial\revlog.pyc", line 1572, in addgroup
  File "mercurial\changelog.pyc", line 138, in _delay
  File "mercurial\changelog.pyc", line 78, in __init__
  File "mercurial\store.pyc", line 480, in __call__
  File "mercurial\scmutil.pyc", line 588, in __call__
  File "mercurial\windows.pyc", line 126, in posixfile
IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\00changelog.i: Access is denied
abort: y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\00changelog.i: Access is denied
[taskcluster 2016-10-13T07:49:05.712Z] Exit Code: 0
[taskcluster 2016-10-13T07:49:05.712Z] === Task Finished ===
[taskcluster 2016-10-13T07:49:05.712Z] Task Duration: 2.2399978s
had to back this out in coordination with Rob after we had some failures like https://treeherder.mozilla.org/logviewer.html#?job_id=4994445&repo=autoland in the following push after this change landed
Flags: needinfo?(rthijssen)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6cb3fe45e731
Backed out changeset 68be44803595 for further investigations after failures when this landed
gps, i think the actual filesystem access is good. it's set here: https://github.com/mozilla-releng/OpenCloudConfig/blob/593a69f40311e4b9fc3b3e9213a9126fa307c11f/userdata/Manifest/gecko-1-b-win2012.json#L1041-L1057

several try pushes succeeded without issue as well as the autoland merge (https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=68be448035953cf34c99cc642374ea54754c86e2). it was a subsequent autoland commit that barfed when an instance (i-01c9e735154d130a7) which had already performed a successful robustcheckout on the preceding task attempted the next robustcheckout. so i guess at that point y:\hg-shared had been created already.

i suspect the error above is just a race condition or file lock issue that could be fixed with a wait/retry loop in robustcheckout. do you have any suggestions?
Flags: needinfo?(rthijssen) → needinfo?(gps)
I don't think it is a race condition or file lock issue, as the stack indicates the permissions failure occurs when opening a file in append mode. This is after a write lock has been acquired.

I think this is your run of the mill file permissions issue. The file in question (.hg/store/00changelog.i) isn't writable by the 2nd task.

IIRC TaskCluster on Windows creates a new user for each task. So my guess is this new user doesn't have write permissions on hg files created by a previous user. Someone will have to poke at the filesystem of a worker to see what permissions in y:\hg-shared look like after a task performs an initial clone.
Flags: needinfo?(gps)
needinfo grenade to grab more filesystem details from Windows worker.

At this point, I just want details. We can discuss options once we see what's happening.
Flags: needinfo?(rthijssen)
to check permissions after initial clone, i modified a task (https://tools.taskcluster.net/task-inspector/#O306m3mcQlKQTIzLS-Ft3Q/) from one of the earlier try commits, using tc task inspector, to perform the clone then run:
`icacls y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store`

the output was:
[taskcluster 2016-10-17T12:20:36.972Z] Executing command 1: icacls y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store
Z:\task_1476704133>icacls y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store 
y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store BUILTIN\Administrators:(I)(OI)(CI)(F)
                                                                NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)
                                                                I-0606177769CDC\task_1476704133:(I)(F)
                                                                CREATOR OWNER:(I)(OI)(CI)(IO)(F)
                                                                BUILTIN\Users:(I)(OI)(CI)(RX)
                                                                BUILTIN\Users:(I)(CI)(AD)
                                                                BUILTIN\Users:(I)(CI)(WD)
Successfully processed 1 files; Failed processing 0 files
[taskcluster 2016-10-17T12:20:37.193Z] Exit Code: 0
Flags: needinfo?(rthijssen)
problem was indeed a run of the mill file permissions issue. the access rights for y:\hg-shared were not correctly applied during the dsc run on windows worker types due to a syntax error in the config. i reran the test today on the corrected environment and got acls that include the "everyone (f)" acl:
y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store Everyone:(I)(OI)(CI)(F)
                                                                BUILTIN\Administrators:(I)(OI)(CI)(F)
                                                                NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)
                                                                I-0EA2E9221E801\task_1476860601:(I)(F)
                                                                CREATOR OWNER:(I)(OI)(CI)(IO)(F)
                                                                BUILTIN\Users:(I)(OI)(CI)(RX)
                                                                BUILTIN\Users:(I)(CI)(AD)
                                                                BUILTIN\Users:(I)(CI)(WD)

https://tools.taskcluster.net/push-inspector/#/O306m3mcQlKQTIzLS-Ft3Q/GnvrY6q7Sw-Rcodw2Thb-w?_k=e4siyj

please can we re-land the patch (environment is now corrected)?
Keywords: checkin-needed
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d99d0f49ac4a
use robustcheckout with ephemeral storage (tc windows); r=gps
Keywords: checkin-needed
The logs don't seem to be updating when running `hg robustcheckout`. This is likely a combination of:

a) progress extension not properly configured
b) need to use PYTHONUNBUFFERED=1 to force Python/Mercurial to not buffer stdout
Blocks: 1311226
progress is working. It looks like the live.log streaming isn't working. The output shows up once the task has completed.

There is a problem with unordered stdout/stderr. For that, we'll need PYTHONUNBUFFERED=1. I may just fix that in robustcheckout.py since multiple consumers have this problem.
https://hg.mozilla.org/mozilla-central/rev/d99d0f49ac4a
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
note that there was an issue on try workers (gecko-1-b-win2012) not getting hg-shared permissions set due to dsc being switched off on worker spot instances. level 2 and 3 repos (like m-c, autoland, etc) are not affected because they use different workers (gecko-2-b-win2012 and gecko-3-b-win2012).

i've corrected the problem here: https://github.com/mozilla-releng/OpenCloudConfig/commit/33f01c828ef6ab25bcf5258eba09e8ef806768e1 and am waiting for the ami rebuild in order to fix tc win builds on try in an hour or so.
(In reply to Rob Thijssen (:grenade - GMT) from comment #31)
> PYTHONUNBUFFERED set on all gecko windows worker types
> https://github.com/mozilla-releng/OpenCloudConfig/commit/
> 333c9a4058bee679c7131cfc587e2cfa3603970f

Can you please back this out?

Setting PYTHONUNBUFFERED globally will affect all Python processes, including thousands of Python processes invoked by the build system. While we do want many processes to be unbuffered, making processes fully unbuffered will result in the process handling stdout/stderr to process a lot more write() calls. This can result in significant overhead.

We're currently hacking PYTHONUNBUFFERED into `hg robustcheckout` calls in a number of places. I'll just modify the robustcheckout extension to unbuffer.
Flags: needinfo?(rthijssen)
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: