Some gecko-1-b-win2012 workers take forever to checkout mercurial files

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
RESOLVED FIXED
a year ago
11 months ago

People

(Reporter: glandium, Assigned: grenade)

Tracking

(Blocks: 1 bug)

Details

Attachments

(1 attachment)

(Reporter)

Description

a year ago
The following tasks, the first of which is still currently running, have taken more than 30 minutes to checkout the mercurial data (after the clone itself, over the network or possibly using the local shared data):
- H_uVPziJR0KEA3BF6T3Wsw
- C-wllbZJSLCYXwfHCe66-Q
- D1I8p43iROeArOJQoF728g
- VS0FG7L4SD2Mt-0aIlNJww
- GSmFf3LNTvK9ZE_yBTm5AA
- c9nbnarpShGtFSOwE9-yHw
- Im9Y7NclTVK0TQTkzcZUvg
(Reporter)

Comment 1

a year ago
Please note that even when it's "fast", it still takes several minutes. For example O3OsQKhIQ6aKF-oLuCPKzQ  still took almost 5 minutes.
Persisting key logging from one of the logs, as the task logs will expire...

Here we see a wall time of 44m46.6840004s for the hg robustcheckout command.


[taskcluster 2017-07-15T00:34:18.644Z] Worker Type (gecko-1-b-win2012) settings:
[taskcluster 2017-07-15T00:34:18.644Z]   {
[taskcluster 2017-07-15T00:34:18.644Z]     "aws": {
[taskcluster 2017-07-15T00:34:18.644Z]       "ami-id": "ami-ccc7f5da",
[taskcluster 2017-07-15T00:34:18.644Z]       "availability-zone": "us-east-1e",
[taskcluster 2017-07-15T00:34:18.644Z]       "instance-id": "i-0bc1c310290ab3cc4",
[taskcluster 2017-07-15T00:34:18.644Z]       "instance-type": "c4.4xlarge",
[taskcluster 2017-07-15T00:34:18.644Z]       "local-ipv4": "172.31.57.209",
[taskcluster 2017-07-15T00:34:18.644Z]       "public-hostname": "ec2-54-145-239-43.compute-1.amazonaws.com",
[taskcluster 2017-07-15T00:34:18.647Z]       "public-ipv4": "54.145.239.43"
[taskcluster 2017-07-15T00:34:18.647Z]     },
[taskcluster 2017-07-15T00:34:18.647Z]     "config": {
[taskcluster 2017-07-15T00:34:18.647Z]       "deploymentId": "1f3ea73960ea",
[taskcluster 2017-07-15T00:34:18.647Z]       "runTasksAsCurrentUser": false
[taskcluster 2017-07-15T00:34:18.647Z]     },
[taskcluster 2017-07-15T00:34:18.647Z]     "generic-worker": {
[taskcluster 2017-07-15T00:34:18.647Z]       "go-arch": "amd64",
[taskcluster 2017-07-15T00:34:18.647Z]       "go-os": "windows",
[taskcluster 2017-07-15T00:34:18.647Z]       "go-version": "go1.7.5",
[taskcluster 2017-07-15T00:34:18.647Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.0.5",
[taskcluster 2017-07-15T00:34:18.647Z]       "version": "10.0.5"
[taskcluster 2017-07-15T00:34:18.647Z]     },
[taskcluster 2017-07-15T00:34:18.647Z]     "machine-setup": {
[taskcluster 2017-07-15T00:34:18.647Z]       "ami-created": "2017-06-29 18:37:12.559Z",
[taskcluster 2017-07-15T00:34:18.647Z]       "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/1f3ea73960ea811c1f0e79729be21e67732bd5b1/userdata/Manifest/gecko-1-b-win2012.json"
[taskcluster 2017-07-15T00:34:18.647Z]     }
[taskcluster 2017-07-15T00:34:18.647Z]   }
[taskcluster 2017-07-15T00:34:18.647Z] Task ID: D1I8p43iROeArOJQoF728g
[taskcluster 2017-07-15T00:34:18.647Z] === Task Starting ===
[taskcluster 2017-07-15T00:34:19.156Z] Uploading file public/logs/live.log as artifact public/logs/live.log
[taskcluster 2017-07-15T00:34:19.651Z] Executing command 0: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision %GECKO_HEAD_REV% %GECKO_HEAD_REPOSITORY% .\build\src

Z:\task_1500078434>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 05d3cf8977013e21a5926e15e2bd5386e16c167f https://hg.mozilla.org/try/ .\build\src 

manifests [>                                                           ]  1/48
                                                                               

files [=>                                                      ]   57/1194 42s
files [==>                                                     ]   64/1194 57s
...
...
...
<snip/>
...
...
...
files [==========================================>             ]  924/1194 06s
files [==========================================>             ]  932/1194 06s
                                                                               
ensuring https://hg.mozilla.org/try/@05d3cf8977013e21a5926e15e2bd5386e16c167f is available at .\build\src
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
(sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
searching for changes
adding changesets
adding manifests
adding file changes
added 48 changesets with 928 changes to 936 files

searching [ <=>                                                            ] 2
                                                                               

changesets [============================================================>] 1/1
                                                                               
adding remote bookmark aurora
adding remote bookmark beta
adding remote bookmark central
adding remote bookmark esr10
adding remote bookmark esr17
adding remote bookmark esr24
adding remote bookmark esr31
adding remote bookmark esr38
adding remote bookmark esr45
adding remote bookmark esr52
adding remote bookmark fx-team
adding remote bookmark inbound
adding remote bookmark release
(pulling to obtain 05d3cf8977013e21a5926e15e2bd5386e16c167f)
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 2 changes to 4 files (+1 heads)

updating [                                                     ]    100/196175
updating [                                               ]    200/196175 1h25m
...
...
...
<snip/>
...
...
...
updating [===============================================> ] 196000/196175 03s
updating [================================================>] 196175/196175 01s
                                                                               
196175 files updated, 0 files merged, 0 files removed, 0 files unresolved
updated to 05d3cf8977013e21a5926e15e2bd5386e16c167f
[taskcluster 2017-07-15T01:19:06.338Z]    Exit Code: 0
[taskcluster 2017-07-15T01:19:06.338Z] Success Code: 0x0
[taskcluster 2017-07-15T01:19:06.339Z]    User Time: 0s
[taskcluster 2017-07-15T01:19:06.339Z]  Kernel Time: 0s
[taskcluster 2017-07-15T01:19:06.339Z]    Wall Time: 44m46.6840004s
[taskcluster 2017-07-15T01:19:06.339Z]  Peak Memory: 2797568
[taskcluster 2017-07-15T01:19:06.339Z]       Result: SUCCEEDED
Assignee: relops → rthijssen
(Assignee)

Comment 3

a year ago
Created attachment 8887956 [details] [diff] [review]
https://github.com/mozilla-releng/OpenCloudConfig/pull/81

this change ensures that y:\hg-shared contains a prepopulated robustcheckout sharebase during ami creation and hopefully reduces checkout time on the first task checkout for each spot instance derived from that ami.
Attachment #8887956 - Flags: review?(pmoore)
Comment on attachment 8887956 [details] [diff] [review]
https://github.com/mozilla-releng/OpenCloudConfig/pull/81

Review of attachment 8887956 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great, thanks Rob!
Attachment #8887956 - Attachment is patch: true
Attachment #8887956 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8887956 - Flags: review?(pmoore) → review+
(Reporter)

Comment 5

a year ago
Note the time was not spent on *cloning*, which hg-shared would affect, but on *checkout*, which is taking the data out of the .hg directory and laying it out on disk outside of it.
Probably something going wrong in the format of the ebs volume, such that all sectors aren't touched - but we are hoping to fix that issue in bug 1378381, which should solve this issue.

I think grenade's patch is still good, as it addresses a related issue (the clone should be faster now), but it sounds like we'll need bug 1378381 solved too.
Depends on: 1378381
I'm reasonably confident that baking the Mercurial share "cache" into the AMI is counter-productive. We've done this before with other instances and it was a net loss.

By baking it into the AMI you are creating the need to touch sectors on instance initialization to mitigate copy-on-first-access with EBS volumes initialized from an AMI. It is faster to initialize an empty EBS volume, not touch sectors on instance initialization, and populate the Mercurial clone on a fresh EBS volume.

Of course, as long as you are accessing sectors on volumes initialized from AMIs on instance initialization, it doesn't matter if the sector has data or not: it will still be slow. So until we stop using AMIs, adding the Mercurial data to the AMI shouldn't have a significant negative impact (other than increasing AMI size and corresponding overhead, which I assume is small). But if the Mercurial data is the reason for using an AMI, then this will likely be slower than no AMI.

FWIW, my measurements have shown that reading sectors on an AMI-initialized volume results in throughput of 2-4 MB/s. An uncompressed Mercurial clone (what robustcheckout will do in EC2) should be >10 MB/s for Windows EC2 instances.

If we want perf wins, I'd focus on not initializing EBS volumes from AMIs.
(Reporter)

Comment 8

a year ago
While not directly relevant to windows builds, I was looking at the memory usage during a linux build the other day, and saw that there was something like a peak of 16GB of caches, and never more than 9GB memory used by processes. On linux, this means we could actually get around any I/O issues if we have any by having /home/worker/workspace on a tmpfs. I wonder if we'd get benefits from using a large RAM disk for the build directory on Windows.
(Reporter)

Comment 9

a year ago
BTW, this bug is still very much a problem. I regularly get Windows builds that spend way more than an hour on `hg update`.
(Assignee)

Comment 10

a year ago
jhford: i'm working on updates to occ that would attach fresh ebs volumes to spot instances at boot. to support this, i have tried to change the launchSpec in AWS provisioner config from:
      "launchSpec": {
        "IamInstanceProfile": {
          "Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache"
        }
      }
to:
      "launchSpec": {
        "IamInstanceProfile": {
          "Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache"
        },
        "BlockDeviceMapping": [
          {
            "DeviceName":"/dev/sda1",
            "Ebs":{
              "VolumeSize": 120,
              "DeleteOnTermination": true
            }
          },
          {
            "DeviceName":"/dev/sdb",
            "Ebs":{
              "VolumeSize": 120,
              "DeleteOnTermination": true
            }
          }
        ]
      }

the provisioner ui shows this error when i attempt to update:

Unknown Error! 

Invalid workerType: ["Error: Launch specifications are invalid\n at Function.WorkerType.testLaunchSpecs (/app/src/worker-type.js:637:15)\n at AwsManager._callee6$ (/app/src/aws-manager.js:323:34)\n at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:64:40)\n at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:299:22)\n at Generator.prototype.(anonymous function) [as next] (/app/node_modules/regenerator-runtime/runtime.js:116:21)\n at step (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:17:30)\n at /app/node_modules/babel-runtime/helpers/asyncToGenerator.js:35:14\n at F (/app/node_modules/core-js/library/modules/_export.js:35:28)\n at AwsManager.<anonymous> (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:14:12)\n at AwsManager.workerTypeCanLaunch (/app/lib/aws-manager.js:671:22)\n at Object._callee4$ (/app/src/api-v1.js:293:42)\n at tryCatch (/app/node_modules/regenerator-runtime/runtime.js:64:40)\n at Generator.invoke [as _invoke] (/app/node_modules/regenerator-runtime/runtime.js:299:22)\n at Generator.prototype.(anonymous function) [as next] (/app/node_modules/regenerator-runtime/runtime.js:116:21)\n at step (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:17:30)\n at /app/node_modules/babel-runtime/helpers/asyncToGenerator.js:35:14\n at F (/app/node_modules/core-js/library/modules/_export.js:35:28)\n at Object.<anonymous> (/app/node_modules/babel-runtime/helpers/asyncToGenerator.js:14:12)\n at Object.<anonymous> (/app/src/api-v1.js:257:1)\n at /app/node_modules/taskcluster-lib-api/src/api.js:538:22\n at tryCallOne (/app/node_modules/promise/lib/core.js:37:12)\n at /app/node_modules/promise/lib/core.js:123:15\n at flush (/app/node_modules/asap/raw.js:50:29)\n at _combinedTickCallback (internal/process/next_tick.js:73:7)\n at process._tickDomainCallback (internal/process/next_tick.js:128:9)"]

i'm basing my launchSpec config on what i understood from http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_LaunchSpecification.html

do you think i have the syntax wrong or is this just unsupported in the current aws provisioner code?
Flags: needinfo?(jhford)
(Assignee)

Comment 11

a year ago
ignore comment 10. it should have been BlockDeviceMappings rather than BlockDeviceMapping
Flags: needinfo?(jhford)
3 failures in 888 pushes (0.003 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-release: 1
* mozilla-central: 1
* mozilla-beta: 1

Platform breakdown:
* windows2012-32: 2
* windows2012-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381768&startday=2017-07-31&endday=2017-08-06&tree=all
FWIW, this bug is particularly bad because it breaks Windows builds that then can't be retriggered from Treeherder due to chain of trust issues. So literally every failure reported under this bug requires manual intervention with appropriate scopes to retrigger a lost job. Is there anything we can do to bump the priority here?
This is the top thing rob has been working on. He's on pto this week and will return to it when he's back on monday.

Comment 15

11 months ago
35 failures in 901 pushes (0.039 failures/push) were associated with this bug in the last 7 days. 

This is the #43 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-beta: 25
* mozilla-central: 4
* mozilla-release: 2
* mozilla-inbound: 2
* autoland: 2

Platform breakdown:
* windows2012-64: 14
* windows2012-32: 10
* windows2012-64-devedition: 6
* windows2012-32-devedition: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381768&startday=2017-08-07&endday=2017-08-13&tree=all

Updated

11 months ago
Blocks: 1271162
Hey Rob, welcome back from PTO! I was wondering if there was any progress to report in this bug?
Flags: needinfo?(rthijssen)
ryanvm: see the blocking bug, that's where the OCC work is being tracked.
Flags: needinfo?(rthijssen)

Comment 18

11 months ago
10 failures in 949 pushes (0.011 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-beta: 10

Platform breakdown:
* windows2012-64-devedition: 4
* windows2012-32-devedition: 3
* windows2012-32: 2
* windows2012-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381768&startday=2017-08-14&endday=2017-08-20&tree=all
(Assignee)

Comment 19

11 months ago
when using fresh ebs volumes, hg robustcheckout fails with the following error:

abort: The filename or extension is too long: 'y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\data/mobile/android/tests/browser/chrome/tp5/amazon.com/www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B002Y27P3M/%5C%22http%3A/g-ecx.images-amazon.com/images/G/01/kindle/shasta/photos'

not sure its related, but my patch to use fresh ebs volumes on try was reverted due to the error above (https://treeherder.mozilla.org/#/jobs?repo=try&revision=074acb44df33719de2e47ee941887fc8da6e61e4&selectedJob=123961923).

Comment 20

11 months ago
The robustcheckout failure will be fixed by bug 1391424.

But, the conditions leading to that error should not occur! If I had to venture a guess, it would be that the population of y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 during instance "bootstrap" is failing somehow.

I would remove anything to do with y:\hg-shared from OCC except for creating an empty directory and setting its permissions. That will force y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29 to be created at first task time. And it should avoid the path too long issues.

Comment 21

11 months ago
Bug 1391424 just landed. So if you take the latest version of robustcheckout from the version-control-tools repo (https://hg.mozilla.org/hgcustom/version-control-tools/raw-file/134574b64ddf/hgext/robustcheckout/__init__.py), this error will be worked around. I'd still like to not populate y:\hg-shared during instance "bootstrap." But that could be done as a follow-up.

Comment 22

11 months ago
15 failures in 143 pushes (0.105 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-beta: 15

Platform breakdown:
* windows2012-64: 5
* windows2012-32: 5
* windows2012-64-devedition: 3
* windows2012-32-devedition: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381768&startday=2017-08-22&endday=2017-08-22&tree=all
Duplicate of this bug: 1393195

Comment 24

11 months ago
24 failures in 173 pushes (0.139 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* try: 20
* mozilla-beta: 3
* mozilla-central: 1

Platform breakdown:
* windows2012-32: 9
* windows2012-64: 7
* windows2012-64-devedition: 4
* windows2012-32-devedition: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381768&startday=2017-08-23&endday=2017-08-23&tree=all

Comment 25

11 months ago
67 failures in 908 pushes (0.074 failures/push) were associated with this bug in the last 7 days. 

This is the #29 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-beta: 41
* try: 24
* mozilla-central: 1
* autoland: 1

Platform breakdown:
* windows2012-32: 26
* windows2012-64: 20
* windows2012-32-devedition: 11
* windows2012-64-devedition: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381768&startday=2017-08-21&endday=2017-08-27&tree=all

Comment 26

11 months ago
11 failures in 939 pushes (0.012 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-beta: 10
* mozilla-esr52: 1

Platform breakdown:
* windows2012-64-devedition: 3
* windows2012-64: 3
* windows2012-32-devedition: 3
* windows2012-32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1381768&startday=2017-08-28&endday=2017-09-03&tree=all
(Assignee)

Comment 27

11 months ago
ebs patch has landed. as have the robustcheckout and mercurial upgrades
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.