Frequent build timeouts when (cloning from upstream repo https://hg.mozilla.org/mozilla-unified )
Categories
(Firefox Build System :: Task Configuration, defect)
Tracking
(Not tracked)
People
(Reporter: CosminS, Assigned: dhouse)
References
Details
(Whiteboard: [stockwell infra])
Attachments
(4 files)
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=288443741&repo=autoland
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288433462&repo=autoland&lineNumber=102
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288437926&repo=autoland&lineNumber=36
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288432758&repo=autoland&lineNumber=41
Timeouts on central: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=busted%2Cexception&revision=99066bbd1b3a288471f2bf7f59f1c99da42610a1&selectedJob=288438390
There were no alerts whatsoever in #vcs in this timeframe to pinpoint to server side issues.
Tom, could these be related to the taskcluster config that switches b-linux workers from AWS to GCP?
Reporter | ||
Comment 1•5 years ago
•
|
||
This is blocking reopening autoland cause the failures rate is still high and there can't be test coverage this way.
Reporter | ||
Comment 2•5 years ago
|
||
We've rerun most of the busted builds and the majority of them were green on first try. There are still failures but they don't seem so frequent so we've reopened autoland.
Reporter | ||
Comment 3•5 years ago
|
||
Re-closed autoland cause it looks grim: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&searchStr=build&fromchange=aa8c928303a75f9845c84c0fb2fcb45d78e60b50&tochange=174d5be06caa95bdc01153cdd8cff5152f406b6f&selectedJob=288462800
Comment hidden (Intermittent Failures Robot) |
Comment 5•5 years ago
|
||
This is also blocking beta builds due to frequent failures. Can we get this fixed or the GCP switch reverted today?
Comment 6•5 years ago
|
||
Looking at some of the logs, it seems there are two problems.
-
the download speed for the bundles is much lower than usual. In https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288433462&repo=autoland&lineNumber=102 it's only 30Mib/s. Testing on my own instance in GCE I was able to hit 46.7 MB/sec and download the repo bundle in 70s. Usual numbers are around 55s for the download.
-
the output logs seem to go blank after a few seconds, showing no output from hg, and then fail with a timeout.
I wonder if there is some sort of disk I/O issue that could be causing both the slow timeouts and the lack of logging output. Can someone take a look at the disk I/O graphs (probably in StackDriver?) and see if anything looks awry for the new build workers compared to the old ones? If it's possible to give me access to those dashboards to debug that would work too.
Comment 7•5 years ago
|
||
I wonder if there is some sort of disk I/O issue that could be causing both the slow timeouts and the lack of logging output. Can someone take a look at the disk I/O graphs (probably in StackDriver?) and see if anything looks awry for the new build workers compared to the old ones? If it's possible to give me access to those dashboards to debug that would work too.
hrm, I don't know who has access to stack driver in production, 303 :fubar
Comment 8•5 years ago
|
||
I think I've given Connor user access to metrics in the fxci-production-level{1,3}-workers projects. Looking at the graphs, it does look pretty spiky and there are some definite outliers, but generally looks ok?
Dave's going to help ya'll debug and get more access as needed.
We discussed the problem in #vcs, on chat.m.o, and determined to roll-back because the cause is not clear and appears to be limited to the worker or project configuration.
Wander requested ssh access, and we both found the ssh connections to the instances (tested with level1) fail.
With a confirmed key added (new, verified through the serial port 1/console), I still get a failed ssh connection. I'm not sure if there is a VPN or firewall rules set up (looking).
Assignee | ||
Comment 10•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #9)
Wander requested ssh access, and we both found the ssh connections to the instances (tested with level1) fail.
With a confirmed key added (new, verified through the serial port 1/console), I still get a failed ssh connection. I'm not sure if there is a VPN or firewall rules set up (looking).
The instances are configured with external IPs, and the firewall rules allow icmp, livelog, from external.
I've added an exception for ssh for myself.
Wander, to ssh in please add or edit a "ssh-external" firewall rule to allow your IP to ssh in, like:
ssh-external
Ingress
Apply to all
IP ranges: yourip/32
tcp:22
Allow
1000
default
and then ssh with gcloud or ssh directly (the web ssh from gcp's console requires adding google's ip ranges to the rule or opening it up wide).
Comment 11•5 years ago
|
||
Hi, are there any updates on the rollback? Is there an ETA?
Updated•5 years ago
|
Comment 12•5 years ago
|
||
This patch essentially backs out 5b47f5e53d2a, but keeps the gecko-{1,3}/b-linux-aws
pools
until we merge everywhere.
- renames
gecko-{1,3}/b-linux
togecko-{1,3}/b-linux-gcp
, and reduces their max capacity. - copies the
gecko-{1,3}/b-linux-aws
configs togecko-{1,3}/b-linux
and ups their max capacity.
Updated•5 years ago
|
Comment 13•5 years ago
|
||
I've applied the patch via ci-admin, so we should be switching over to aws builds shortly.
Comment 14•5 years ago
|
||
I decided that leaving this unlanded was probably playing with fire, so I landed it r=bustage https://hg.mozilla.org/ci/ci-configuration/rev/153c4ba1772cec9c2dc3c8e13e06f85d0cc57da1 . Still open for review or followup as needed.
Comment 15•5 years ago
|
||
Getting a lot of errors like
Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z'. The referenced image resource cannot be found.
Not sure what to do about this, other than leave it for Wander's/Miles' morning?
Comment 16•5 years ago
•
|
||
Fyi changes from bug 1547111 got backed out here:
https://hg.mozilla.org/integration/autoland/rev/7dcbf3debe6a3a527df82c422446ca42a3de78c4
https://hg.mozilla.org/mozilla-central/rev/5fa1a31cb9547a0400f0955707e46b3046680123
We're no longer hitting this issue after the rollback, except on existing gcp gecko-{1,3}/b-linux workers that are timing out running tasks .
Assignee | ||
Comment 17•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #10)
... (the web ssh from gcp's console requires adding google's ip ranges to the rule or opening it up wide).
Wander, I added a rule to allow the gcp console browser ssh connections in l1. So you can use that if you need to attach and test/debug. I'll remove it when you're done or in a few days.
Comment 18•5 years ago
|
||
(In reply to Aki Sasaki [:aki] (he/him) (UTC-7) from comment #15)
Getting a lot of errors like
Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z'. The referenced image resource cannot be found.
Not sure what to do about this, other than leave it for Wander's/Miles' morning?
https://hg.mozilla.org/ci/ci-configuration/rev/c4be240dd8f6c83fe3f548ab946096df223fc56a should fix this issue.
dhouse is killing old gcp gecko-{1,3}/b-linux
workers.
Comment 19•5 years ago
|
||
No timeouts since workers got killed.
Comment hidden (Intermittent Failures Robot) |
Comment 21•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #10)
(In reply to Dave House [:dhouse] from comment #9)
Wander requested ssh access, and we both found the ssh connections to the instances (tested with level1) fail.
With a confirmed key added (new, verified through the serial port 1/console), I still get a failed ssh connection. I'm not sure if there is a VPN or firewall rules set up (looking).The instances are configured with external IPs, and the firewall rules allow icmp, livelog, from external.
I've added an exception for ssh for myself.Wander, to ssh in please add or edit a "ssh-external" firewall rule to allow your IP to ssh in, like:
ssh-external
Ingress
Apply to all
IP ranges: yourip/32
tcp:22
Allow
1000
default
and then ssh with gcloud or ssh directly (the web ssh from gcp's console requires adding google's ip ranges to the rule or opening it up wide).
I still have no permission:
$ gcloud beta compute --project "fxci-production-level1-workers" ssh --zone "us-central1-c" "gecko-1-b-linux-gcp-a-zmqnybq2gwr7tzzputtg"
WARNING: The public SSH key file for gcloud does not exist.
WARNING: The private SSH key file for gcloud does not exist.
WARNING: You do not have an SSH key for gcloud.
WARNING: SSH keygen will be executed to generate a key.
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
'Your identification has been saved in /home/vagrant/.ssh/google_compute_engine.
Your public key has been saved in /home/vagrant/.ssh/google_compute_engine.pub.
The key fingerprint is:
SHA256:47xxSavHa81GdWP6fcsA2X9hVc5Ohk3iBl7B6WY627c vagrant@vagrant
The key's randomart image is:
+---[RSA 3072]----+
| ..+oo|
| . +oB.|
| ..+ B|
| o.=*o|
| S .o *o+o|
| o o o=.o .|
| +.++ =..o|
| =+ = +.=|
| oo.o E+|
+----[SHA256]-----+
Updating project ssh metadata...failed.
Updating instance ssh metadata...failed.
ERROR: (gcloud.beta.compute.ssh) Could not add SSH key to instance metadata:
- The user does not have access to service account 'taskcluster-worker@fxci-production-level1-workers.iam.gserviceaccount.com'. User: 'wcosta@mozilla.com'. Ask a project owner to grant you the iam.serviceAccountUser role on the service account
Assignee | ||
Comment 22•5 years ago
|
||
(In reply to Wander Lairson Costa from comment #21)
- The user does not have access to service account 'taskcluster-worker@fxci-production-level1-workers.iam.gserviceaccount.com'. User: 'wcosta@mozilla.com'. Ask a project owner to grant you the iam.serviceAccountUser role on the service account
Please try again now; I re-added your access to level1. Your access was removed-- I'll need to check if there is automation that removed it for if someone manually did so.
Comment 23•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #22)
(In reply to Wander Lairson Costa from comment #21)
- The user does not have access to service account 'taskcluster-worker@fxci-production-level1-workers.iam.gserviceaccount.com'. User: 'wcosta@mozilla.com'. Ask a project owner to grant you the iam.serviceAccountUser role on the service account
Please try again now; I re-added your access to level1. Your access was removed-- I'll need to check if there is automation that removed it for if someone manually did so.
We talked in matrix and I can login to the machines now. :aki, we are having problem to reproduce to problem since the rollback, could you please redeploy ci-configuration in staging?
Assignee | ||
Comment 24•5 years ago
•
|
||
I'm reducing the severity on this bug since it is no longer a "blocker" for ci.
Updated•5 years ago
|
Assignee | ||
Comment 25•5 years ago
|
||
Adding some of the discussion yesterday from debugging the problem in #vcs in chat.m.o:
dhouse
sheehan: what do you think at this point? With your test on a separate gcp instance showing good performance, can we determine the problem is with the instances image or configuration and not the pipe to hg.m.o (or cache/other?)?
sheehan
@dh
dhouse:
ugh. trying to select someones name to ping them requires the mosue apparently. sigh.
I don't know at this point. the tasks are all failing to download a file from a GCS bucket in the same region when they hit these slowness/no logs problems. I doubt it's an issue with instances downloading files from within google
dhouse
walac: since these are new, can you switch the tasks back to all on aws until this problem is resolved? (and this can be worked through in staging?)
sheehan
dhouse: fwiw we previously had a subset of builds running in GCP, and they didn't have this issue AFAIK. is there any difference between the workers from that subset and the new workers?
Comment 26•5 years ago
|
||
(In reply to Wander Lairson Costa from comment #23)
We talked in matrix and I can login to the machines now. :aki, we are having problem to reproduce to problem since the rollback, could you please redeploy ci-configuration in staging?
Done.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 27•5 years ago
|
||
I dealt with switching us back, but we probably need a different owner to fix the underlying issue.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 29•5 years ago
|
||
Wander, do you have more information on the problem? (Maybe notes on what may need to be done, or links to the PRs and bugs?) I don't have a complete view of what happened before I joined to check the disk-io/network issues for this bug. Or if this is being worked on in another bug, could you close this out and link to there?
Comment 30•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #29)
Wander, do you have more information on the problem? (Maybe notes on what may need to be done, or links to the PRs and bugs?) I don't have a complete view of what happened before I joined to check the disk-io/network issues for this bug. Or if this is being worked on in another bug, could you close this out and link to there?
afaik, the tracking bug for the vcs issue is this one.
It is worth mention the patches that switch from aws to gcp:
ci-configuration: https://phabricator.services.mozilla.com/D60042
mozilla-central: https://phabricator.services.mozilla.com/D60042
Updated•5 years ago
|
Assignee | ||
Comment 31•5 years ago
|
||
Connor, are there pieces that need to be set up for caching/bundles for workers in gcp? (bugs that this bug can depend on) Or is that ready and do we just need to test this (in staging) at scale again?
Comment 32•5 years ago
|
||
The caching/bundles are already set up, the only assumption is that all the tasks are taking place in us-central1.
Assignee | ||
Comment 33•5 years ago
|
||
Hi Wander, did you get a chance to test this in staging? Or is there some setup that needs to be done (that I can help with) to stage builds in gcp?
Assignee | ||
Comment 34•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #33)
Hi Wander, did you get a chance to test this in staging? Or is there some setup that needs to be done (that I can help with) to stage builds in gcp?
If we can get the builds running in staging, maybe we can try to reproduce the timeouts there with scale matching what we had in production.
In case we cannot reproduce it, do you know how we can configure the "shadow builds" (I'm not sure what they're called -- since the aws builds are prod/release, and there are shadow/unused builds running on gcp?) to run with the new image?
Comment 35•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #34)
If we can get the builds running in staging, maybe we can try to reproduce the timeouts there with scale matching what we had in production.
In case we cannot reproduce it, do you know how we can configure the "shadow builds" (I'm not sure what they're called -- since the aws builds are prod/release, and there are shadow/unused builds running on gcp?) to run with the new image?
Kendall: what's the expectation on your side? Should Wander be running further tests?
The larger question is whether staging performance is an adequate proxy for production. We weren't hitting these timeouts with only tier 3 load, and I don't know if we are equipped to simulate a full production workload in staging. We should figure that out if we want staging results to matter.
Assignee | ||
Comment 36•5 years ago
|
||
Wander, I will do the testing and I may have questions for you. Here are my notes so far:
At the failure time, configs were at:
https://hg.mozilla.org/ci/ci-configuration/file/b44ac91dc9706738c4c9d01bcf3d530651e6fb5a/worker-pools.yml#l103
https://hg.mozilla.org/mozilla-central/file/aa01bf29eae7993e8bfb147c16a789a16bc45617/taskcluster/ci/build
Current state looking at the worker+pool configurations at https://hg.mozilla.org/ci/ci-configuration/file/c5027a4f096d2da89bff62e27e64dd4bff4258c1/worker-pools.yml#l86 (current; after https://phabricator.services.mozilla.com/D62672) and the task to pool mappings at https://hg.mozilla.org/mozilla-central/file/b7a8b4961e01b8a38be141bb89cc93f1ad404e4e/taskcluster/ci/build (after https://phabricator.services.mozilla.com/rMOZILLACENTRAL7dcbf3debe6a3a527df82c422446ca42a3de78c4):
The instance image and type and pools are still in gcp but without the tasks routed to them.
The tier 3 validation builds (*"shadow") are in gcp. tier1 builds are in aws.
The change flipping aws with gcp was:
worker+pool config: ci-configuration:/worker-pools.yml (names used in m-c ymls like "workerType: b-linux-aws" s/-gcp//g, "provider_id: aws")
task to pool mapping: m-c:/taskcluster/ci/build/{android,linux,macosx}.yml (s/gcp/aws/g)
The "revert" kept the pools in gcp (flipped the tier1/3 task mapping in m-c and restored the aws pools in ci-configuration).
I need to collect data on where the failures happened (worker type, confirming pools and instance type/specs), and the infra config (cache, region, ?) during failures and now. Then i can look at reproducing with the "prod" but-turned-off gcp pools as "staging" since they match (they are) production.
Comment 37•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #36)
Wander, I will do the testing and I may have questions for you. Here are my notes so far:
At the failure time, configs were at:
https://hg.mozilla.org/ci/ci-configuration/file/b44ac91dc9706738c4c9d01bcf3d530651e6fb5a/worker-pools.yml#l103
https://hg.mozilla.org/mozilla-central/file/aa01bf29eae7993e8bfb147c16a789a16bc45617/taskcluster/ci/buildCurrent state looking at the worker+pool configurations at https://hg.mozilla.org/ci/ci-configuration/file/c5027a4f096d2da89bff62e27e64dd4bff4258c1/worker-pools.yml#l86 (current; after https://phabricator.services.mozilla.com/D62672) and the task to pool mappings at https://hg.mozilla.org/mozilla-central/file/b7a8b4961e01b8a38be141bb89cc93f1ad404e4e/taskcluster/ci/build (after https://phabricator.services.mozilla.com/rMOZILLACENTRAL7dcbf3debe6a3a527df82c422446ca42a3de78c4):
The instance image and type and pools are still in gcp but without the tasks routed to them.
The tier 3 validation builds (*"shadow") are in gcp. tier1 builds are in aws.The change flipping aws with gcp was:
worker+pool config: ci-configuration:/worker-pools.yml (names used in m-c ymls like "workerType: b-linux-aws" s/-gcp//g, "provider_id: aws")
task to pool mapping: m-c:/taskcluster/ci/build/{android,linux,macosx}.yml (s/gcp/aws/g)
The "revert" kept the pools in gcp (flipped the tier1/3 task mapping in m-c and restored the aws pools in ci-configuration).I need to collect data on where the failures happened (worker type, confirming pools and instance type/specs), and the infra config (cache, region, ?) during failures and now. Then i can look at reproducing with the "prod" but-turned-off gcp pools as "staging" since they match (they are) production.
Your description is pretty accurate, I believe the change also include deleting the current b-linux
worker pool in ci-configuration
, as it is configured for AWS. 302 to :tomprince to see if he sees something wrong.
Assignee | ||
Comment 38•5 years ago
|
||
(In reply to Wander Lairson Costa from comment #37)
Your description is pretty accurate, I believe the change also include deleting the current
b-linux
worker pool inci-configuration
, as it is configured for AWS. 302 to :tomprince to see if he sees something wrong.
Thank you for confirming!
Assignee | ||
Comment 39•5 years ago
|
||
Reviewing the failures from comment 0. It looks like they stopped logging mid-pull and then, long after, the task timed-out. Possibly, the log was cut off or something hung.
Two pulled from cached bundles, https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified, and the other did not reach the pull step (expecting it did or would have used the bundle but I think the log is interrupted).
build-linux64-tsan-fuzzing/opt
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288433462&repo=autoland&lineNumber=102
[taskcluster 2020-02-12 00:03:08.515Z] Worker Group: fxci-level3-gcp
[taskcluster 2020-02-12 00:03:08.515Z] Worker Node Type: projects/970387039909/machineTypes/custom-32-36864
[taskcluster 2020-02-12 00:03:08.515Z] Worker Type: b-linux
[...]
[vcs 2020-02-12T00:03:38.588Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2020-02-12T00:03:38.690Z] region google/us-central1 not yet supported; using public hg.mozilla.org service
[vcs 2020-02-12T00:03:38.691Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2020-02-12T00:03:38.770Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', 'b4dc366c916164498a62980e867c0cf5647e6d41', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/workspace/build/src']
[vcs 2020-02-12T00:03:38.865Z] (using Mercurial 4.8.1)
[vcs 2020-02-12T00:03:38.865Z] ensuring https://hg.mozilla.org/integration/autoland@b4dc366c916164498a62980e867c0cf5647e6d41 is available at /builds/worker/workspace/build/src
[vcs 2020-02-12T00:03:39.446Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2020-02-12T00:03:39.780Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2020-02-12T00:03:40.225Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/b8ad0267e88abff0a54a6f0f1b5de2ad3a018460.stream-v2.hg
[vcs 2020-02-12T00:03:40.296Z] 561909 files to transfer, 3.19 GB of data
[...]
[vcs 2020-02-12T00:04:46.164Z] clone [=====================> ] 1787855560/3426007566 1m00s
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2020-02-12 01:33:36.083Z] === Task Finished ===
re-ran pull time (local cache, "sharing from existing pooled repository 8ba995b..."):
02:49:03 to 02:49:33
artifact-build-linux64-artifact/opt
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288437926&repo=autoland&lineNumber=36
[taskcluster 2020-02-11 23:59:28.653Z] Worker Group: fxci-level3-gcp
[taskcluster 2020-02-11 23:59:28.653Z] Worker Node Type: projects/970387039909/machineTypes/custom-32-36864
[taskcluster 2020-02-11 23:59:28.653Z] Worker Type: b-linux
[...]
[vcs 2020-02-12T00:35:23.592Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2020-02-12T00:35:23.705Z] region google/us-central1 not yet supported; using public hg.mozilla.org service
[vcs 2020-02-12T00:35:23.705Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2020-02-12T00:35:23.793Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', 'b4dc366c916164498a62980e867c0cf5647e6d41', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/workspace/build/src']
[vcs 2020-02-12T00:35:23.880Z] (using Mercurial 4.8.1)
[vcs 2020-02-12T00:35:23.880Z] ensuring https://hg.mozilla.org/integration/autoland@b4dc366c916164498a62980e867c0cf5647e6d41 is available at /builds/worker/workspace/build/src
[vcs 2020-02-12T00:35:24.547Z] (shared store has an active lock; assuming it is left over from a previous process and that the store is corrupt; deleting store and destination just to be sure)
[vcs 2020-02-12T00:35:33.303Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2020-02-12 01:35:21.169Z] === Task Finished ===
re-ran pull time:
[vcs 2020-02-12T03:53:35.560Z] transferred 3.19 GB in 98.0 seconds (33.3 MB/sec)
build-macosx64/debug
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288432758&repo=autoland&lineNumber=41
[taskcluster 2020-02-11 23:59:28.653Z] Worker Group: fxci-level3-gcp
[taskcluster 2020-02-11 23:59:28.653Z] Worker Node Type: projects/970387039909/machineTypes/custom-32-36864
[taskcluster 2020-02-11 23:59:28.653Z] Worker Type: b-linux
[...]
[vcs 2020-02-11T23:59:53.801Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2020-02-11T23:59:53.925Z] region google/us-central1 not yet supported; using public hg.mozilla.org service
[vcs 2020-02-11T23:59:53.925Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2020-02-11T23:59:54.004Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '02f22447369053837e8bc80c436224a8eb62bf16', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/workspace/build/src']
[vcs 2020-02-11T23:59:54.100Z] (using Mercurial 4.8.1)
[vcs 2020-02-11T23:59:54.101Z] ensuring https://hg.mozilla.org/integration/autoland@02f22447369053837e8bc80c436224a8eb62bf16 is available at /builds/worker/workspace/build/src
[vcs 2020-02-11T23:59:54.557Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2020-02-11T23:59:54.709Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2020-02-11T23:59:55.099Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/b8ad0267e88abff0a54a6f0f1b5de2ad3a018460.stream-v2.hg
[vcs 2020-02-11T23:59:55.270Z] 561909 files to transfer, 3.19 GB of data
[vcs 2020-02-11T23:59:57.355Z]
[vcs 2020-02-11T23:59:58.355Z] clone [ ] 56359221/3426007566 2m05s
[vcs 2020-02-11T23:59:59.394Z] clone [> ] 87628124/3426007566 1m58s
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2020-02-12 01:29:51.256Z] === Task Finished ===
not re-ran
Comment 40•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #39)
Reviewing the failures from comment 0. It looks like they stopped logging mid-pull and then, long after, the task timed-out. Possibly, the log was cut off or something hung.
this reminded me of something glob mentioned: if you pull a disk out from under hg while it's trying to update/clone, it'll simply hang for a long time with no output, not unlike what happens when an NFS mount goes away. so their thought was that the issue may be due to underlying storage problems/config/etc.
(In reply to Chris Cooper [:coop] pronoun: he from comment #35)
Kendall: what's the expectation on your side? Should Wander be running further tests?
Nah, Dave's driving this. If Wander can be available to answer questions, then I think we're good.
The larger question is whether staging performance is an adequate proxy for production. We weren't hitting these timeouts with only tier 3 load, and I don't know if we are equipped to simulate a full production workload in staging. We should figure that out if we want staging results to matter.
Agreed. This will not be fun if it only appears under full load.
Assignee | ||
Comment 41•5 years ago
|
||
(In reply to Kendall Libby [:fubar] (he/him) from comment #40)
(In reply to Dave House [:dhouse] from comment #39)
Reviewing the failures from comment 0. It looks like they stopped logging mid-pull and then, long after, the task timed-out. Possibly, the log was cut off or something hung.
this reminded me of something glob mentioned: if you pull a disk out from under hg while it's trying to update/clone, it'll simply hang for a long time with no output, not unlike what happens when an NFS mount goes away. so their thought was that the issue may be due to underlying storage problems/config/etc.
Thank you! I'll keep this in mind and may put in some debug logging to try to catch this kind of hang and monitor the disk state.
Assignee | ||
Comment 42•5 years ago
|
||
Reviewing the timed-out tasks from https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358, they mostly hit the same failures as the 3 examples in comment 0; interrupted clone, or "cloning" message and then nothing.
Examples (last log line before the "Task timeout" message):
~/dev/releng$ grep -B2 "Task timeout" *live_backing.log | grep clone
1.android-4-0-armv7-api16.debug.7140686373634092707.1581453475.1581460703.live_backing.log-[vcs 2020-02-11T20:39:19.401Z] clone [========================> ] 1945217846/3426007566 39s
1.android-4-2-x86.opt.5135040868078535476.1581448740.1581455970.live_backing.log-[vcs 2020-02-11T19:20:38.873Z] clone [==================> ] 1571017784/3426007566 1m16s
1.android-5-0-aarch64.debug.2221161125476516910.1581448079.1581455318.live_backing.log-[vcs 2020-02-11T19:09:35.612Z] clone [====================> ] 1669120405/3426007566 55s
1.android-5-0-aarch64.opt.9218729948170072210.1581453025.1581460266.live_backing.log-[vcs 2020-02-11T20:31:59.359Z] clone [=================> ] 1504703846/3426007566 1m02s
1.android-5-0-x86_64.opt.7208775679883523062.1581452652.1581459886.live_backing.log-[vcs 2020-02-11T20:25:40.188Z] clone [=============> ] 1179549325/3426007566 1m32s
1.linux64-devedition.opt.4048035201520608580.1581521897.1581527322.live_backing.log-[vcs 2020-02-12T15:39:32.167Z] clone [====================> ] 1668359888/3427138758 49s
1.linux64.asan.7155597052681846487.1581465741.1581471165.live_backing.log-[vcs 2020-02-12T00:03:41.011Z] clone [=======================> ] 1863318163/3426007566 43s
1.linux64.asan.8212565173965073315.1581454747.1581460175.live_backing.log-[vcs 2020-02-11T21:00:44.736Z] clone [==================> ] 1581098989/3426007566 1m17s
1.linux64.debug.2636751713044725744.1581465972.1581468696.live_backing.log-[vcs 2020-02-12T00:07:34.630Z] clone [======================> ] 1777894088/3426007566 51s
1.linux64.debug.5510109461410872770.1581532006.1581534730.live_backing.log-[vcs 2020-02-12T18:28:08.704Z] clone [=========================> ] 2050883465/3427138758 37s
1.linux64.debug.8963396912676798448.1581465974.1581468698.live_backing.log-[vcs 2020-02-12T00:07:32.458Z] clone [=================> ] 1512134115/3426007566 1m03s
1.linux64.opt.4779619533921929029.1581447680.1581454911.live_backing.log-[vcs 2020-02-11T19:02:51.421Z] clone [======================> ] 1779466952/3426007566 53s
1.linux64.opt.8784018350883684421.1581448955.1581456179.live_backing.log-[vcs 2020-02-11T19:24:10.149Z] clone [==================> ] 1592507805/3426007566 1m17s
1.linux64.tsan.4992956426439574725.1581447809.1581453233.live_backing.log-[vcs 2020-02-11T19:05:00.361Z] clone [========================> ] 1907023313/3426007566 50s
1.linux64.tsan.6699715376909844223.1581465676.1581471106.live_backing.log-[vcs 2020-02-12T00:02:41.787Z] clone [====================> ] 1659055240/3426007566 55s
1.linux64.tsan.740457908796325518.1581465787.1581471217.live_backing.log-[vcs 2020-02-12T00:04:46.164Z] clone [=====================> ] 1787855560/3426007566 1m00s
1.linux64.tsan.956158274572320471.1581465746.1581471170.live_backing.log-[vcs 2020-02-12T00:03:55.873Z] clone [====================> ] 1714493512/3426007566 1m02s
1.osx-cross.debug.1341284457246523837.1581465568.1581470992.live_backing.log-[vcs 2020-02-11T23:59:59.394Z] clone [> ] 87628124/3426007566 1m58s
1.osx-shippable.opt.6899541328811475486.1581448530.1581452155.live_backing.log-[vcs 2020-02-11T19:17:09.471Z] clone [===========================> ] 2178740183/3426007566 41s
1.toolchains.opt.8075127049421755574.1581460131.1581463760.live_backing.log-[vcs 2020-02-11T22:30:06.591Z] clone [================> ] 1388857754/3426007566 1m02s
1.windows2012-32-devedition.opt.2258701274242243645.1581507348.1581514571.live_backing.log-[vcs 2020-02-12T11:37:08.555Z] clone [=========================> ] 2038976193/3427138758 38s
1.windows2012-32-devedition.opt.935603359957235370.1581516930.1581522354.live_backing.log-[vcs 2020-02-12T14:16:42.283Z] clone [=======================> ] 1833288050/3427138758 39s
1.windows2012-32-shippable.opt.2203446541454065495.1581449745.1581456971.live_backing.log-[vcs 2020-02-11T19:37:12.615Z] clone [=======================> ] 1889987992/3426007566 47s
1.windows2012-aarch64-shippable.opt.6611326461786301728.1581521950.1581529174.live_backing.log-[vcs 2020-02-12T15:40:47.577Z] clone [=======================> ] 1844440000/3427138758 59s
2.linux64-aarch64.opt.6142367542606216485.1581450267.1581457507.live_backing.log-[vcs 2020-02-11T19:45:51.924Z] clone [===================> ] 1537001821/3426007566 48s
2.linux64.debug.4320382539503960048.1581465975.1581471428.live_backing.log-[vcs 2020-02-12T00:07:34.119Z] clone [==========> ] 883863620/3426007566 58s
2.linux64.opt.4304945460674466998.1581460133.1581463761.live_backing.log-[vcs 2020-02-11T22:30:22.245Z] clone [========================> ] 1979023452/3426007566 42s
2.windows-mingw32.all.4522614149832682231.1581454456.1581461680.live_backing.log-[vcs 2020-02-11T20:55:32.809Z] clone [====================> ] 1607309398/3426007566 55s
~/dev/releng$ grep -B2 "Task timeout" *live_backing.log | grep cloning
1.android-4-0-armv7-api16.debug.6899541328811475486.1581452158.1581459387.live_backing.log-[vcs 2020-02-11T20:16:43.503Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-0-armv7-api16.pgo.4992956426439574725.1581453239.1581460477.live_backing.log-[vcs 2020-02-11T20:34:50.492Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-0-armv7-api16.pgo.8730667323694669983.1581524945.1581532169.live_backing.log-[vcs 2020-02-12T16:29:32.061Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-0-geckoview-fat-aar.opt.7120138743283557661.1581534867.1581542089.live_backing.log-[vcs 2020-02-12T19:14:52.361Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-2-x86.debug.8140242803783263007.1581524946.1581532169.live_backing.log-[vcs 2020-02-12T16:29:32.720Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-2-x86.opt.2913941764092885856.1581524946.1581532171.live_backing.log-[vcs 2020-02-12T16:29:34.809Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-5-0-x86_64.debug.2042436425209502781.1581524945.1581532169.live_backing.log-[vcs 2020-02-12T16:29:33.136Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32-shippable.opt.1136869659206653110.1581467649.1581471266.live_backing.log-[vcs 2020-02-12T00:34:34.699Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32-shippable.opt.2707280311610152118.1581468012.1581471632.live_backing.log-[vcs 2020-02-12T00:40:45.572Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32-shippable.opt.3261117199127843984.1581467802.1581471422.live_backing.log-[vcs 2020-02-12T00:37:13.072Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32-shippable.opt.6474424353456821430.1581467802.1581471405.live_backing.log-[vcs 2020-02-12T00:36:46.062Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32.debug.4304945460674466998.1581464407.1581469829.live_backing.log-[vcs 2020-02-11T23:40:42.224Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64-shippable.opt.1087673278437182443.1581487118.1581490738.live_backing.log-[vcs 2020-02-12T05:59:11.115Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64-shippable.opt.3528686596447447834.1581468580.1581472197.live_backing.log-[vcs 2020-02-12T00:49:59.803Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.asan.7999058355361287261.1581528590.1581534008.live_backing.log-[vcs 2020-02-12T17:30:10.428Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.5474995518196054417.1581528806.1581534223.live_backing.log-[vcs 2020-02-12T17:33:45.525Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.6474424353456821430.1581462018.1581467436.live_backing.log-[vcs 2020-02-11T23:00:49.183Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.6911032307653124192.1581528426.1581533829.live_backing.log-[vcs 2020-02-12T17:27:19.654Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.7080638530357990544.1581465591.1581468294.live_backing.log-[vcs 2020-02-12T00:00:03.008Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.7822863695774066644.1581528638.1581534058.live_backing.log-[vcs 2020-02-12T17:31:10.087Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-cross-devedition.opt.2893510033435132656.1581504717.1581508336.live_backing.log-[vcs 2020-02-12T10:52:28.703Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-cross-devedition.opt.5135040868078535476.1581455973.1581463192.live_backing.log-[vcs 2020-02-11T21:20:03.480Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-cross-devedition.opt.6049069488453583567.1581504716.1581508333.live_backing.log-[vcs 2020-02-12T10:52:22.036Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-cross.debug.8637672537182855361.1581513689.1581519107.live_backing.log-[vcs 2020-02-12T13:21:58.447Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.2743940490296478580.1581507154.1581510772.live_backing.log-[vcs 2020-02-12T11:32:53.900Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.3725644940419437750.1581467841.1581471464.live_backing.log-[vcs 2020-02-12T00:37:52.959Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.4632915189236143286.1581467841.1581471462.live_backing.log-[vcs 2020-02-12T00:37:46.983Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.7155074290737386678.1581508365.1581511968.live_backing.log-[vcs 2020-02-12T11:52:48.995Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.8944846970400183478.1581467841.1581472884.live_backing.log-[vcs 2020-02-12T01:01:36.837Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-32-shippable.opt.8598930111106808623.1581507178.1581514396.live_backing.log-[vcs 2020-02-12T11:33:21.566Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-32-shippable.opt.8911141639019205814.1581507161.1581514363.live_backing.log-[vcs 2020-02-12T11:32:43.967Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-64-shippable.opt.7080638530357990544.1581468318.1581475520.live_backing.log-[vcs 2020-02-12T00:45:20.949Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-aarch64-devedition.opt.6464653165419460976.1581507671.1581514888.live_backing.log-[vcs 2020-02-12T11:41:39.748Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-aarch64-shippable.opt.3160669768184135721.1581471325.1581478952.live_backing.log-[vcs 2020-02-12T01:42:37.435Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
2.linux64.opt.2413071487011296400.1581467704.1581471322.live_backing.log-[vcs 2020-02-12T00:35:33.303Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
2.linux64.opt.8555794478947795067.1581480514.1581484117.live_backing.log-[vcs 2020-02-12T04:08:37.521Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
However, some of those are windows platform.
Misc others in the group have task or other actions before the timeout:
~/dev/releng$ grep -B2 "Task timeout" *live_backing.log | grep -Ev '(-$|Task timeout|clon)'
1.android-em-7-0-x86_64.debug.machine-39.1581435343.1581438050.live_backing.log-[task 2020-02-11T16:19:23.183Z] 16:19:23 INFO - Using adb 1.0.41
1.android-em-7-0-x86_64.opt.machine-22.1581314833.1581317537.live_backing.log-[task 2020-02-10T06:50:58.092Z] 06:50:58 INFO - Using adb 1.0.41
1.linux32-shippable.opt.3997768055339783382.1581467968.1581471570.live_backing.log-[fetches 2020-02-12T00:39:39.296Z] Removing /builds/worker/fetches/upx.tar.xz
1.linux32.debug.i-0c6ece209a02cea52.1581458517.1581463958.live_backing.log-[task 2020-02-11T23:32:33.894Z] 23:32:33 INFO - package-tests> make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
1.linux32.opt.i-0886d3918088c8ecc.1581356381.1581359139.live_backing.log-[task 2020-02-10T18:20:12.245Z] 18:20:12 INFO - make[1]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/testing/geckodriver'
1.linux64.debug.i-0ada66834e3727eff.1581358813.1581364238.live_backing.log-[task 2020-02-10T19:49:06.469Z] 19:49:06 INFO - package-tests> make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
1.linux64.debug.i-0dacdcbc862638ee4.1581425506.1581430922.live_backing.log-[task 2020-02-11T14:21:46.248Z] 14:21:46 INFO - make[1]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
1.linux64.debug.i-0f9befaaf5ea438ad.1581376730.1581382150.live_backing.log-[task 2020-02-11T00:48:57.749Z] 00:48:57 INFO - buildsymbols> make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
1.osx-cross.debug.i-07e77e82213a1381a.1581369251.1581374711.live_backing.log-[task 2020-02-10T22:24:01.341Z] 22:24:01 INFO - make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/ipc/app'
2.linux1804-64-asan.opt.i-0caf4f36ae11f20f5.1581377613.1581384816.live_backing.log-[task 2020-02-11T01:33:33.805Z] 01:33:33 INFO - Closing window 39
2.linux1804-64-asan.opt.i-0d193853be308d9ae.1581528813.1581536241.live_backing.log-[task 2020-02-12T19:37:09.790Z] 19:37:09 INFO - PID 10837 | 1581536229785 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: about:reader*
2.linux1804-64-asan.opt.i-0fb1a4be8fdf74831.1581419024.1581426659.live_backing.log-[task 2020-02-11T13:10:57.649Z] 13:10:57 INFO - TEST-START | /_mozilla/dom/classList.html
2.linux1804-64-ccov.opt.i-06c2bcc4d2abf27f8.1581515486.1581520888.live_backing.log-[task 2020-02-12T15:20:03.491Z] 15:20:03 INFO - TEST-START | dom/push/test/xpcshell/test_clear_forgetAboutSite.js
2.symbols.opt.1142744006236503768.1581465773.1581469386.live_backing.log-aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host msdl.microsoft.com:443 ssl:default [None]
Assignee | ||
Comment 43•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #42)
Reviewing the timed-out tasks from https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358, they mostly hit the same failures as the 3 examples in comment 0; interrupted clone, or "cloning" message and then nothing.
Nothing appears significant to me from these listings and counts across worker types (various workers, tasks, and times).
Since the logs show the same interruptions on windows workers, I wonder if it is a problem happening there also (and if we are running builds there, then maybe there are more tasks to review).
Assignee | ||
Comment 44•5 years ago
|
||
(In reply to Wander Lairson Costa from comment #37)
Your description is pretty accurate, I believe the change also include deleting the current
b-linux
worker pool inci-configuration
, as it is configured for AWS. 302 to :tomprince to see if he sees something wrong.
:tomprince, could you guide me on how to send tasks through for only the gecko-1/b-linux-gcp workerType? Can I locally commit a change to taskcluster/ci/build/linux.yml (s/b-linux/b-linux-gcp/g) and mach try
with some build tasks, or is there a better, or actual, way to use linux-gcp/debug (https://hg.mozilla.org/mozilla-central/file/b7a8b4961e01b8a38be141bb89cc93f1ad404e4e/taskcluster/ci/build/linux.yml#l1350)?
Comment 45•5 years ago
|
||
You can select the -gcp
builds from mach try fuzzy --full
. Or, you can pass --worker-suffix b-linux=-gcp
to most mach try selectors, to run everything that would run on gecko-1/b-linux on gecko-1/b-linux-gcp instead; doing the later won't select particular tasks, but any that are selected will run on that worker.
There is also --worker-override <alias>=<worker-pool>
if you need more control, but most uses can use the suffix version.
Assignee | ||
Comment 46•5 years ago
|
||
I kicked off ~400 builds. treeherder shows them at 87% complete, and no failures. I see 329 active gce instances. https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c3cd02d3533602e6a08b200973e335065f27fa5
If there are failures in this group, then it will be simpler to track down. If there are none, then I think it may be a higher load or specific timing issue, and I can try a burst of tasks to reach the failure level and review timing (what else is happening).
I also want to expand my earlier search on timeouts to see if there are timeouts with the same log-cut-off behavior that I am expecting is a symptom of the problem. If it has been happening on regular tier3 builds, and on windows tasks, then it is not likely a problem specific to the linux workers.
Assignee | ||
Comment 47•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #46)
I kicked off ~400 builds. treeherder shows them at 87% complete, and no failures. I see 329 active gce instances. https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c3cd02d3533602e6a08b200973e335065f27fa5
One build failed but not with a timeout. Instead it failed with a connection reset message:
[vcs 2020-02-28T05:44:48.170Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '6c3cd02d3533602e6a08b200973e335065f27fa5', 'https://hg.mozilla.org/try', '/builds/worker/workspace/build/src']
[vcs 2020-02-28T05:44:48.255Z] (using Mercurial 4.8.1)
[vcs 2020-02-28T05:44:48.255Z] ensuring https://hg.mozilla.org/try@6c3cd02d3533602e6a08b200973e335065f27fa5 is available at /builds/worker/workspace/build/src
[vcs 2020-02-28T05:44:48.795Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2020-02-28T05:44:48.902Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2020-02-28T05:44:49.559Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/2c1dd534e937c92b4c5965892212d52d9b8a58f8.stream-v2.hg
[vcs 2020-02-28T05:44:49.628Z] 564092 files to transfer, 3.21 GB of data
[vcs 2020-02-28T05:44:51.630Z]
[vcs 2020-02-28T05:44:52.629Z] clone [=> ] 190404715/3449400926 35s
[vcs 2020-02-28T05:44:53.629Z] clone [=> ] 228307637/3449400926 43s
[vcs 2020-02-28T05:44:54.629Z] clone [==> ] 281860902/3449400926 45s
[vcs 2020-02-28T05:44:55.972Z] clone [===> ] 339920859/3449400926 46s
[vcs 2020-02-28T05:44:55.972Z]
[vcs 2020-02-28T05:44:55.972Z] PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["projects/887720501152/machineTypes/custom-32-73728"], "hgVersion": "4.8.1", "lowerIsBetter": true, "name": "clone_errored", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 7.1731109619140625}, {"extraOptions": ["projects/887720501152/machineTypes/custom-32-73728"], "hgVersion": "4.8.1", "lowerIsBetter": true, "name": "overall", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 7.713670969009399}, {"extraOptions": ["projects/887720501152/machineTypes/custom-32-73728"], "hgVersion": "4.8.1", "lowerIsBetter": true, "name": "overall_clone", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 7.713670969009399}, {"extraOptions": ["projects/887720501152/machineTypes/custom-32-73728"], "hgVersion": "4.8.1", "lowerIsBetter": true, "name": "overall_clone_fullcheckout", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 7.713670969009399}]}
[vcs 2020-02-28T05:44:55.973Z] abort: Connection reset by peer
[taskcluster 2020-02-28 05:44:56.491Z] === Task Finished ===
[taskcluster 2020-02-28 05:44:56.702Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2020-02-28 05:44:56.892Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2020-02-28 05:44:57.107Z] Unsuccessful task run with exit code: 255 completed in 35.76 seconds
That looks very similar to the failures-- a log of clone that does not show completion. The PERFHERDER_DATA entry and following abort is different and immediately after a clone progress log entry. So It looks like a different problem (not hung/timeout but interrupted).
Assignee | ||
Comment 48•5 years ago
|
||
Rob, have you seen hg clone timeouts on windows builders in GCP? There are Windows machines also in the failures Feb 10-12 when we had timeouts on linux builds: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358
But I did not find any in current papertrail logs. So something is likely different then from now (possibly the mass of linux workers not running). If you have seen this problem at other times, do you know about the cause or circumstances?
Assignee | ||
Comment 49•5 years ago
|
||
Connor, do you know of mercurial debugging settings or logging that can show more details during a clone if the clone is stalling? I have not investigated what is available or read the mercurial source yet.
I've tried reproducing the failures (with a batch of 400 builds) and that did not give me a timeout/stall. So it looks like I will need to create a larger load; I'm manually setting up a load test of the clone on a small pool of instances to see if I can get a stall/failure that way without needing to spin up a high number of worker instances.
Assignee | ||
Comment 50•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #49)
Connor, do you know of mercurial debugging settings or logging that can show more details during a clone if the clone is stalling? I have not investigated what is available or read the mercurial source yet.
I've tried reproducing the failures (with a batch of 400 builds) and that did not give me a timeout/stall. So it looks like I will need to create a larger load; I'm manually setting up a load test of the clone on a small pool of instances to see if I can get a stall/failure that way without needing to spin up a high number of worker instances.
These are running robustcheckout. I am testing with just hg --debug robustcheckout [...]
(I still need to produce the problem, but I want to catch some debug logs or something when I get it reproduced).
Assignee | ||
Comment 51•5 years ago
|
||
They are using an attached 375gb disk:
Disk /dev/nvme0n1: 375 GiB, 402653184000 bytes, 98304000 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk /dev/mapper/instance_storage-all: 375 GiB, 402648989696 bytes, 98302976 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
$ sudo pvs
PV VG Fmt Attr PSize PFree
/dev/nvme0n1 instance_storage lvm2 a-- <375.00g 0
$ sudo vgdisplay
--- Volume group ---
VG Name instance_storage
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 2
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 1
Act PV 1
VG Size <375.00 GiB
PE Size 4.00 MiB
Total PE 95999
Alloc PE / Size 95999 / <375.00 GiB
Free PE / Size 0 / 0
VG UUID tQ2n7Y-MgpA-YzgN-Wqpd-u4HT-qURe-SGkPBB
These instances are running with 10gb disks and 95gb memory.
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#performance.
This is only for the system boot disk. Poor i/o because of a small disk (shared) should not affect us because the clone is going into memory and the attached 375gb disk.
Comment 52•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #49)
Connor, do you know of mercurial debugging settings or logging that can show more details during a clone if the clone is stalling? I have not investigated what is available or read the mercurial source yet.
I've tried reproducing the failures (with a batch of 400 builds) and that did not give me a timeout/stall. So it looks like I will need to create a larger load; I'm manually setting up a load test of the clone on a small pool of instances to see if I can get a stall/failure that way without needing to spin up a high number of worker instances.
You can use --traceback --verbose --debug
on most commands, which should add debugging output and tell Mercurial to print the stack trace should it encounter an error. Those are global Mercurial flags so they should work with robustcheckout
. You could also use --config ui.debug=true
and see if that adds more output (though IIRC it is equivalent to --debug
). There is also progress.debug=true
for debugging the progress bar - I'm not sure how useful this will be but since we seem to be losing logs of the progress bar I thought it would be worth mentioning. YMMV.
There is also the extremely useful builtin blackbox
extension which will create a log within the cloned repository (ie at repo/.hg/blackbox.log
). If you run the clone command like hg --config extensions.blackbox= --verbose --traceback --debug robustcheckout ...
all debugging output will make it to that log. You can see some of the other config options for blackbox by running hg --config extensions.blackbox= help -e blackbox
.
Hope this helps. :)
Assignee | ||
Comment 53•5 years ago
|
||
(In reply to Cosmin Sabou [:CosminS] from comment #1)
This is blocking reopening autoland cause the failures rate is still high and there can't be test coverage this way.
:CosminS, can you tell me what the rate of failures was? I looked around but I don't know where to find that in treeherder/other; I can look at machine types or check tasks in activedata, but I'm wondering if you know an easier way to find the counts.
Assignee | ||
Comment 54•5 years ago
|
||
(In reply to Connor Sheehan [:sheehan] from comment #52)
Hope this helps. :)
Thank you Connor! This gives me more of the information I was hoping to find, and I'm hoping (when I can reproduce it) that the progress debug and traceback catch something. I'm testing to see if I get a traceback from mercurial (with these options) when I issue a hangup, and if I get the same from a task-run timeout.
Reporter | ||
Comment 55•5 years ago
•
|
||
(In reply to Dave House [:dhouse] from comment #53)
(In reply to Cosmin Sabou [:CosminS] from comment #1)
This is blocking reopening autoland cause the failures rate is still high and there can't be test coverage this way.
:CosminS, can you tell me what the rate of failures was? I looked around but I don't know where to find that in treeherder/other; I can look at machine types or check tasks in activedata, but I'm wondering if you know an easier way to find the counts.
Dave, I would gladly give you a percentage and exact numbers but unfortunately I cannot do that. I can show you how it looked on trees, especially on autoland where there were a busy night with lots of pushes and failures/bustages overlapping.
This is what was starred as expected fail against this bug: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&tochange=39fd9ff27ad869a1d0168ac557b8c54d6a142515&fromchange=be916959142fc9ed315e855e2dce4b57fcac8e05&failure_classification_id=3
There were many more failures to come in next hrs and it's only a range with pushes affected by this.
Maybe Aryx could help with a telemetry query for what was starred against this bug as expected fail and have more appropriate numbers. Thank you.
Assignee | ||
Comment 56•5 years ago
|
||
(In reply to Cosmin Sabou [:CosminS] from comment #55)
(In reply to Dave House [:dhouse] from comment #53)
(In reply to Cosmin Sabou [:CosminS] from comment #1)
This is blocking reopening autoland cause the failures rate is still high and there can't be test coverage this way.
:CosminS, can you tell me what the rate of failures was? I looked around but I don't know where to find that in treeherder/other; I can look at machine types or check tasks in activedata, but I'm wondering if you know an easier way to find the counts.
Dave, I would gladly give you a percentage and exact numbers but unfortunately I cannot do that. I can show you how it looked on trees, especially on autoland where there were a busy night with lots of pushes and failures/bustages overlapping.
This is what was starred as expected fail against this bug: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&tochange=39fd9ff27ad869a1d0168ac557b8c54d6a142515&fromchange=be916959142fc9ed315e855e2dce4b57fcac8e05&failure_classification_id=3
There were many more failures to come in next hrs and this is only a range with pushes affected by this.
Maybe Aryx could help with a telemetry query for what was starred against this bug as expected fail and have more appropriate numbers. Thank you.
THank you this is helpful even to know I'm not overlooking an obvious way to get the numbers. I started working off of some of Aryx's queries in redash to track this down, and I think I have the data I need (timeout (thx to treeherder text_log_error) and total counts).
I want to give numbers for this to google support and it may help me in figuring out how to reproduce the failure.
Comment 57•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #48)
Rob, have you seen hg clone timeouts on windows builders in GCP? There are Windows machines also in the failures Feb 10-12 when we had timeouts on linux builds: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358
But I did not find any in current papertrail logs. So something is likely different then from now (possibly the mass of linux workers not running). If you have seen this problem at other times, do you know about the cause or circumstances?
the windows build log i checked, makes it look like the instance was lost in the middle of the clone operation. because the instance just stops creating log output mid clone. later the queue resolves the task as a timeout.
![]() |
||
Comment 58•5 years ago
|
||
List of tasks classified as this bug: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-11-07&endday=2020-03-06&tree=all&bug=1614852
Assignee | ||
Comment 59•5 years ago
|
||
(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #57)
(In reply to Dave House [:dhouse] from comment #48)
Rob, have you seen hg clone timeouts on windows builders in GCP? There are Windows machines also in the failures Feb 10-12 when we had timeouts on linux builds: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358
But I did not find any in current papertrail logs. So something is likely different then from now (possibly the mass of linux workers not running). If you have seen this problem at other times, do you know about the cause or circumstances?the windows build log i checked, makes it look like the instance was lost in the middle of the clone operation. because the instance just stops creating log output mid clone. later the queue resolves the task as a timeout.
Thanks! That is similar to linux because the task log just stops without an error or anything (perhaps it is just the log upload that is terminated/killed). I thought the log is resolved on the instance and that task process just stops creating log output. BUT, I need to verify that is happening and that it isn't the queue resolving them as timeout.
Assignee | ||
Comment 60•5 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #58)
List of tasks classified as this bug: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-11-07&endday=2020-03-06&tree=all&bug=1614852
Thanks :Aryx! I'll review these (i was looking at 1411358 before). I am also collecting all tasks with timeouts Feb2-24+ and checking if we had any repeats outside the problem window Feb10-12: https://sql.telemetry.mozilla.org/queries/68967/source#table with the few gcp workers we have there.
Assignee | ||
Comment 61•5 years ago
|
||
I've tested over the additional debugging options. Also I made a minimal depth walk through the mercurial source to confirm the process according to the logs (to review when/if the problem is reproduced). Example start of clone with gcs bundle:
(using Mercurial 4.8.1)
ensuring https://hg.mozilla.org/try@6c3cd02d3533602e6a08b200973e335065f27fa5 is available at /builds/worker/workspace/build/src
using https://hg.mozilla.org/mozilla-unified
sending capabilities command
hg.mozilla.org certificate matched fingerprint sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48
sending lookup command
sending 5 bytes
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
sending lookup command
sending 5 bytes
(sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
using https://hg.mozilla.org/mozilla-unified
sending capabilities command
hg.mozilla.org certificate matched fingerprint sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48
sending clonebundles command
applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/14caf2826f00fc609f7fe55d8df14e7e08fd1ced.stream-v2.hg
bundle2-input-bundle: with-transaction
bundle2-input-part: "stream2" (params: 3 mandatory) supported
applying stream bundle
564985 files to transfer, 3.22 GB of data
[...]
Parsed as stream2 (https://www.mercurial-scm.org/repo/hg-stable/file/4.8.1/mercurial/bundle2.py#l499), matches bundle header:
$ curl -r 0-200 -o - "https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/14caf2826f00fc609f7fe55d8df14e7e08fd1ced.stream-v2.hg" | hexdump -C
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 201 100 201 0 0 862 0 --:--:-- --:--:-- --:--:-- 858
00000000 48 47 32 30 00 00 00 00 00 00 00 77 07 53 54 52 |HG20.......w.STR|
00000010 45 41 4d 32 00 00 00 00 03 00 09 0a 09 06 0c 35 |EAM2...........5|
00000020 62 79 74 65 63 6f 75 6e 74 33 34 35 39 34 30 32 |bytecount3459402|
00000030 33 31 33 66 69 6c 65 63 6f 75 6e 74 35 36 34 39 |313filecount5649|
00000040 38 35 72 65 71 75 69 72 65 6d 65 6e 74 73 64 6f |85requirementsdo|
00000050 74 65 6e 63 6f 64 65 25 32 43 66 6e 63 61 63 68 |tencode%2Cfncach|
00000060 65 25 32 43 67 65 6e 65 72 61 6c 64 65 6c 74 61 |e%2Cgeneraldelta|
00000070 25 32 43 72 65 76 6c 6f 67 76 31 25 32 43 73 74 |%2Crevlogv1%2Cst|
00000080 6f 72 65 00 00 80 00 73 11 c3 03 64 61 74 61 2f |ore....s...data/|
00000090 2e 61 72 63 63 6f 6e 66 69 67 2e 69 00 03 00 01 |.arcconfig.i....|
000000a0 00 00 00 00 00 00 00 44 00 00 00 46 00 00 00 00 |.......D...F....|
000000b0 00 06 b9 74 ff ff ff ff ff ff ff ff 78 81 e6 c2 |...t........x...|
000000c0 0d 7c 38 d6 37 a3 ab 7c 8b |.|8.7..|.|
000000c9
To consider later, hg 4.8.1 is from Nov 2018. If this is caused by a bug in mercurial, it may be fixed in the current stable or current version. We might also look at the version of extensions (such as robustcheckout) we're using.
Assignee | ||
Comment 62•5 years ago
|
||
A timed-out task is forcably killed by docker-worker (https://github.com/taskcluster/docker-worker/blob/b089a1dbf5aed5b9cd1f70fcf25f1e92ff05a15a/src/lib/task.js#L680 example: https://firefox-ci-tc.services.mozilla.com/tasks/NymXXRv4RrWSDbHGwV3CtQ/runs/0/logs/live/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FNymXXRv4RrWSDbHGwV3CtQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log). So there is no chance for mercurial to log its trace or handle the kill. So, to get a trace and logs, I need to perform the timeout and hangup within the task. The task is not given its timeout as a variable, but I can pick a reasonable time for the download since the task timeouts shown in the failure logs are all 1800s+ (mostly 3600, 5400, 7200) and the bundle download time ranges 100-300s, most around 180s, in my testing.
Assignee | ||
Comment 63•5 years ago
|
||
more, re: the force kill:
(In reply to Dave House [:dhouse] from comment #62)
A timed-out task is forcably killed by docker-worker (https://github.com/taskcluster/docker-worker/blob/b089a1dbf5aed5b9cd1f70fcf25f1e92ff05a15a/src/lib/task.js#L680
in this call, we aren't getting the kill result or if the container still existed. The function returns a promise that resolves with the result.
uses dockerode, https://github.com/apocas/dockerode/blob/f5606ed233625bdd1a9a9be4458e439f9040aac1/lib/container.js#L637,
uses the api call /containers/container_id/kill
which issues SIGKILL by default: https://docs.docker.com/engine/api/v1.40/#operation/ContainerKill
also explained in the comment: https://github.com/taskcluster/docker-worker/blob/b089a1dbf5aed5b9cd1f70fcf25f1e92ff05a15a/src/lib/task.js#L677
the run function which is waiting on the container run to finish (issues start and then wait): https://github.com/taskcluster/docker-worker/blob/b089a1dbf5aed5b9cd1f70fcf25f1e92ff05a15a/src/lib/task.js#L929 - https://github.com/apocas/dockerode/blob/f5606ed233625bdd1a9a9be4458e439f9040aac1/lib/docker.js#L1464
Assignee | ||
Comment 64•5 years ago
|
||
I opened a case with google support, 22377255. I've had 3 back-and-forth emails so far, and the details that may be relevent to the problem/cause so far are:
While inspecting internal resources from February the 10th through the 12th for project “fxci-production-level1-workers, I see you had 247 Compute Engine VMs on February the 12th around 18:00 PST time consuming 953 mbps internal network traffic at the highest peak. As far as disk issues, nics, etc… I was not able to locate any issues other than the 953 mbps of internal network traffic between your two GCP projects.
After inspecting project “fxci-production-level3-workers” during this time , you had 704 Compute Engine VMs at your highest peak on February 12 around 01:00 PST in the us-central1 region 422 mbps and no disk problems, nics, etc… I also suspect high internal network traffic.
Following [1], there is some competition between network traffic and persistent disk traffic, so this may be the issue at hand. There potentially could have been some instability due to the high amount of usage. My next step will be to investigate this hunch to verify.You can expect an update no later than 6pm PST tomorrow.
[1] https://cloud.google.com/compute/docs/disks/performance#egress_performance_cap
GCP VM disk i/o is on the same network as the vm's nic. And that disk i/o is capped based on vm size.
Reading this, I am curious if it is possible to lock up a system with combined high network and disk i/o. They *should just be throttled.
Re: the docker kill on timeout, we don't know the container state (only that we've reached a timeout without task end, and logs were truncated). I'm still thinking there may be something more with that and not with network/disk as a cause; It may be that some of these instances were hung or the logs were disconnected. Perhaps when a disk write is throttled, the clone stalls and does not resume when the disk is ready (on the gcp vm, or a bug with the lvm/config/something).
Assignee | ||
Comment 65•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #64)
GCP VM disk i/o is on the same network as the vm's nic. And that disk i/o is capped based on vm size.
Reading this, I am curious if it is possible to lock up a system with combined high network and disk i/o. They *should just be throttled.
I'm fairly certain this is not our problem. The support agent asked about the attached disk, and on checking it is local SSD so I've asked is that is throttled with network (as total data in/out of the vm). I'm assuming it is not and that the compete between network and disk is for persistent disks only and not local ssd (if I remember my gce work from years ago, that was the case then) unless there is throttling for all vm data in/out.
I downloaded logs for all timeout tests from jan 1st through mar 11. The timeouts during cloning appear on gcp only from feb 5th through 13th. I pushed those numbers, grouped by tier1/2/3, to influx to compare with gc monitoring data: https://earthangel-b40313e5.influxcloud.net/d/9vTbVNXWk/gcp-hg-timeouts?orgId=1&from=1581453741293&to=1581594479200 That has shown me is that we had more timeouts when we had more tasks running, but also that we had timeouts throughout that period, only increasing/decreasing based on the number of instances. So, it does not look like it was caused by the scaling, number of instances.
I've noticed the same timeouts in aws (the logging matches with a hg clone stalled/interrupted and then much later the task is timed-out terminated). So I'm reviewing that to check when those happen, verifying they are real (?!), and how frequently.
Assignee | ||
Comment 66•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #65)
I've noticed the same timeouts in aws (the logging matches with a hg clone stalled/interrupted and then much later the task is timed-out terminated). So I'm reviewing that to check when those happen, verifying they are real (?!), and how frequently.
The hg clone timeouts on tasks in aws span from 2019-11-13 (This is when the treeherder db started recording "text_log_error.line". so I don't have history beyond that date and can assume the occurrences continued before that date) until today.
4 happened yesterday on aws. They don't hit with the same volume when there is scaling as we saw on gcp. So that is weird if it is regularly intermittent, we should see it increase when we're scaled up in aws the same as we saw in gcp (maybe we do at a lower scale -- checking scaling numbers). BUT, I should also have seen it within my test of 400 tasks on gcp if it was simply 1 in N chances of hitting it. So it seems like it is a problem in something that is not a function of the instance scaling -- and that we saw it more when scaling because more instances within the problem occurrence windows were hitting it: A-B-C with B scaled up, showing problem in C. The difference on AWS could be that the problem thing recovers more quickly there (maybe GCS/S3 bucket or something like that).
I should be able to reproduce it in aws because it is ongoing there; that, or I could catch an instance that hits the problem in aws before it is terminated. However, reproducing in aws may show a different cause than gcp; I can look back at this if I cannot reproduce it in gcp as it may show some relative cause (like s3 vs gcs).
I'm going to try a re-try hg clone for the next few days in gcp to see if I get a hit of the timeout. This will be a few instances running 24xN. So the cost is minimal compared with testing at the scale of 700+ instances.
Assignee | ||
Comment 67•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #66)
(In reply to Dave House [:dhouse] from comment #65)
I've noticed the same timeouts in aws (the logging matches with a hg clone stalled/interrupted and then much later the task is timed-out terminated). So I'm reviewing that to check when those happen, verifying they are real (?!), and how frequently.
The hg clone timeouts on tasks in aws span from 2019-11-13 (This is when the treeherder db started recording "text_log_error.line". so I don't have history beyond that date and can assume the occurrences continued before that date) until today.
Connor, what do you think could cause the mercurial client to do this? (stopping with no logging; we don't get a trace because the container is killed/dropped). I think it may be something like network drops that the mercurial client is not catching and reaches a bad state or keeps waiting.
I found we see the same problem on aws (back as far as I could track in treeherder's db, to Nov 13 '19), but not at the same volume as we hit in gcp. Here are some recent times for the timeout on aws workers (next log line for each was a task-kill timeout message):
logs/1.windows2012-32-shippable.opt.i-003e9237b981e7620.1579914066.1579921281
[vcs 2020-01-25T01:01:20.019Z] (cloning from upstream repo https://us-east-1.hgmointernal.net/mozilla-unified)
logs/1.android-5-0-x86_64.asan.i-003e9237b981e7620.1579921779.1579929001
[vcs 2020-01-25T03:09:59.160Z] (cloning from upstream repo https://us-east-1.hgmointernal.net/mozilla-unified)
logs/1.linux32-shippable.opt.i-0f4fa7f012c0d90ed.1580775450.1580779748
[vcs 2020-02-04T01:28:36.327Z] updating [============================================> ] 267700/285445 2m29s
logs/1.linux32-shippable.opt.i-0df4ed6a331e2ebfc.1580778092.1580779769
[vcs 2020-02-04T01:29:19.699Z] clone [===========================================> ] 3398433941/3419233525 07s
logs/1.linux64-shippable.opt.i-0df4ed6a331e2ebfc.1580780301.1580786330
[vcs 2020-02-04T01:52:42.438Z] (cloning from upstream repo https://us-east-1.hgmointernal.net/mozilla-unified)
logs/1.osx-shippable.opt.i-02139cfcb8d9fc133.1581079622.1581080483
[vcs 2020-02-07T12:57:23.451Z] clone [==========================================> ] 3323112934/3424341814 07s
logs/1.osx-shippable.opt.i-0c85a77090dcecc5b.1581087228.1581088610
[vcs 2020-02-07T15:16:43.655Z] clone [===========================================> ] 3419210782/3424341814 03s
Assignee | ||
Comment 68•5 years ago
|
||
shoot, I was looking at the last month. here is a March 6th example on aws:
logs/2.symbols.opt.i-06d078ecaec590c56.1583453056.1583454439
[vcs 2020-03-06T00:27:08.674Z] clone [==========================================> ] 3354366474/3457790664 35s
Comment 69•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #67)
Connor, what do you think could cause the mercurial client to do this? (stopping with no logging; we don't get a trace because the container is killed/dropped). I think it may be something like network drops that the mercurial client is not catching and reaches a bad state or keeps waiting.
I've got no idea, to be honest. Mercurial does have issues with flaky networks, but usually this results in an error message/traceback indicating what went wrong. We've had someone on our team (Mitch) running hg clones with a purposely bad environment attempting to replicate some of the issues we've been seeing (see https://bugzilla.mozilla.org/show_bug.cgi?id=1291926#c101 for his first attempts to try and reproduce a network failure).
If we're always losing the log output during the clone while downloading a clone bundle, the first place to look for issues in upstream Mercurial would be this function. You can see this function is called directly between the status messages "applying clone bundle from XXX" and "finished applying clone bundle" here. It may be worth trying to run some clones with --profile
and seeing if anything useful comes of that.
It's interesting that we've hit similar errors in AWS as well. Can you provide some URLs to tasks in Treeherder or Taskcluster so I can take a closer look? It's odd that we don't see this issue as much in AWS (this is my first time hearing of it). Maybe there's some interaction going on here, like hg
has issues with shoddy connections, and GCS is less reliable for downloads than S3, so we see the issue show up much more frequently in GCP.
Assignee | ||
Comment 70•5 years ago
|
||
other thread/topic, continuing to try to reproduce in gcp:
Reviewing the boot disk images, I see there is one updated on Feb12th.
The last timeout on GCP was Feb 12th at 10:12am UTC (2:12am pacific).
The image was created Feb 13th 03:32 UTC ("2020-02-12T19:32:59.061-08:00")
This could be why I could not reproduce the problem since I was using this new image (testing on current -gcp workers through try). (++humility for not noticing this sooner)
I'm testing next with the old image and checking metrics+logs for what images were used when.
Assignee | ||
Comment 71•5 years ago
|
||
(In reply to Connor Sheehan [:sheehan] from comment #69)
I've got no idea, to be honest. Mercurial does have issues with flaky networks, but usually this results in an error message/traceback indicating what went wrong. We've had someone on our team (Mitch) running hg clones with a purposely bad environment attempting to replicate some of the issues we've been seeing (see https://bugzilla.mozilla.org/show_bug.cgi?id=1291926#c101 for his first attempts to try and reproduce a network failure).
If we're always losing the log output during the clone while downloading a clone bundle, the first place to look for issues in upstream Mercurial would be this function. You can see this function is called directly between the status messages "applying clone bundle from XXX" and "finished applying clone bundle" here. It may be worth trying to run some clones with
--profile
and seeing if anything useful comes of that.
Thanks! I'll check these out. I think we're losing any logs or trace because we're force killing off the container and processes instead of sending a hangup/etc (My testing showed I lost any trace and trailing logs when I killed off a container running hg).
It's interesting that we've hit similar errors in AWS as well. Can you provide some URLs to tasks in Treeherder or Taskcluster so I can take a closer look? It's odd that we don't see this issue as much in AWS (this is my first time hearing of it). Maybe there's some interaction going on here, like
hg
has issues with shoddy connections, and GCS is less reliable for downloads than S3, so we see the issue show up much more frequently in GCP.
flakiness with GCS vs. S3 is my leading hypothesis for now :)
Here are treeherder links for some aws failures:
[taskcluster 2020-03-06 00:04:16.435Z] Task ID: WXkMwluDQomiGL3kfVUWPw
[taskcluster 2020-03-06 00:04:16.435Z] Worker ID: i-06d078ecaec590c56
[taskcluster 2020-03-06 00:04:16.435Z] Worker Group: aws
[taskcluster 2020-03-06 00:04:16.435Z] Worker Node Type: c5.4xlarge
[taskcluster 2020-03-06 00:04:16.435Z] Worker Type: b-linux
[taskcluster 2020-03-06 00:04:16.435Z] Public IP: 18.205.41.175
[taskcluster 2020-03-06 00:04:16.435Z] Hostname: ip-10-145-96-104
[taskcluster 2020-03-06 00:04:16.435Z] using cache "gecko-level-3-checkouts-sparse-v3-f730704b60167efec983" -> /builds/worker/checkouts
These look like the download was slow (not a long time gap between last log line and the timeout kill), but not with the huge time gap we saw on gcp.
So it doesn't look as similar as I thought. The gcp logs timeout during clone also, but with a long time gap like:
[vcs 2020-02-13T03:12:52.225Z] ensuring https://hg.mozilla.org/integration/autoland@2f205d029f748eb1ac47ea1057511d1c1f4f7754 is available at /builds/worker/checkouts/gecko
[vcs 2020-02-13T03:12:52.879Z] (shared store has an active lock; assuming it is left over from a previous process and that the store is corrupt; deleting store and destination just to be sure)
[vcs 2020-02-13T03:13:01.474Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[taskcluster:error] Task timeout after 1800 seconds. Force killing container.
[taskcluster 2020-02-13 04:08:01.001Z] === Task Finished ===
or
[vcs 2020-02-12T21:04:30.521Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/2a2b0b1b3be697dfbabe4dac69304c91878f28ef.stream-v2.hg
[vcs 2020-02-12T21:04:30.583Z] 562193 files to transfer, 3.19 GB of data
[vcs 2020-02-12T21:04:32.586Z]
[vcs 2020-02-12T21:04:33.585Z] clone [> ] 144006766/3427138758 46s
[...]
[vcs 2020-02-12T21:05:23.629Z] clone [=======================> ] 1870442547/3427138758 44s
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2020-02-12 22:34:25.861Z] === Task Finished ===
Assignee | ||
Comment 72•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #70)
other thread/topic, continuing to try to reproduce in gcp:
Reviewing the boot disk images, I see there is one updated on Feb12th.
The last timeout on GCP was Feb 12th at 10:12am UTC (2:12am pacific).
The image was created Feb 13th 03:32 UTC ("2020-02-12T19:32:59.061-08:00")
The image being used by a worker is configured in https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-images.yml
(for apply by ci-admin to worker-manager).
The image history based on the commits (hgmo push time, UTC) is:
docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z 2020-01-14 21:24:47
docker-worker-gcp-googlecompute-2020-02-04t16-30-23z 2020-02-05 23:52:21
docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z 2020-02-13 04:16:54
So the image at the failure time was:
monopacker-docker-worker-trusted-2020-02-04t16-30-23z:
fxci-level3-gcp:
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
from:
changeset: 618:c25b1fc0a6c8
user: Wander Lairson Costa <wander.lairson@gmail.com>
date: Wed Feb 05 23:52:11 2020 +0000
summary: Bug 1611255: Fix correct path for level 3 GCP image r=tomprince
diff --git a/worker-images.yml b/worker-images.yml
--- a/worker-images.yml
+++ b/worker-images.yml
@@ -120,7 +120,7 @@ monopacker-docker-worker-2020-02-03t19-2
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
monopacker-docker-worker-trusted-2020-02-04t16-30-23z:
fxci-level3-gcp:
- projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
+ projects/fxci-production-level3-workers/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
following
changeset: 616:e0c509239c21
user: Wander Lairson Costa <wander.lairson@gmail.com>
date: Tue Feb 04 18:56:47 2020 +0000
summary: Bug 1611255: Deploy trusted GCP builders r=Callek
diff --git a/worker-images.yml b/worker-images.yml
--- a/worker-images.yml
+++ b/worker-images.yml
@@ -10,7 +10,7 @@
docker-worker-hvm-builder-current: docker-worker-v201912041449-hvm-builder
docker-worker-hvm-builder-trusted-current: docker-worker-v201912041449-hvm-builder-trusted
monopacker-docker-worker-current: monopacker-docker-worker-2020-02-03t19-23-00z
-monopacker-docker-worker-trusted-current: monopacker-docker-worker-2020-02-03t19-23-00z
+monopacker-docker-worker-trusted-current: monopacker-docker-worker-trusted-2020-02-04t16-30-23z
occ-b-win2012-current: occ-b-win2012-pre
occ-b-win2012-trusted-current: occ-b-win2012-trusted-pre
@@ -118,6 +118,9 @@ monopacker-docker-worker-2020-02-03t19-2
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
fxci-levelt-gcp:
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
+monopacker-docker-worker-trusted-2020-02-04t16-30-23z:
+ fxci-level3-gcp:
+ projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
following
changeset: 611:532c0442bc8b
user: Wander Lairson Costa <wander.lairson@gmail.com>
date: Mon Feb 03 20:06:43 2020 +0000
summary: Bug 1611255: Upgrade GCP workers r=Callek
diff --git a/worker-images.yml b/worker-images.yml
--- a/worker-images.yml
+++ b/worker-images.yml
@@ -9,8 +9,8 @@
---
docker-worker-hvm-builder-current: docker-worker-v201912041449-hvm-builder
docker-worker-hvm-builder-trusted-current: docker-worker-v201912041449-hvm-builder-trusted
-monopacker-docker-worker-current: monopacker-docker-2019-11-18t20-32-23z
-monopacker-docker-worker-trusted-current: monopacker-docker-worker-trusted-2020-01-10t21-13-32z
+monopacker-docker-worker-current: monopacker-docker-worker-2020-02-03t19-23-00z
+monopacker-docker-worker-trusted-current: monopacker-docker-worker-2020-02-03t19-23-00z
occ-b-win2012-current: occ-b-win2012-pre
occ-b-win2012-trusted-current: occ-b-win2012-trusted-pre
@@ -111,6 +111,13 @@ monopacker-docker-worker-trusted-2020-01
fxci-level3-gcp:
# uses new cot key, i.e. tk/Sj...
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z
+monopacker-docker-worker-2020-02-03t19-23-00z:
+ fxci-level1-gcp:
+ projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
+ fxci-level3-gcp:
+ projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
+ fxci-levelt-gcp:
+ projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
The bug was for updating the gcp image to decrease the default reclaim time: bug 1611255 "[docker-worker] High volume of tasks failing with claim-expired in GCP"
Assignee | ||
Comment 73•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #72)
The image history based on the commits (hgmo push time, UTC) is:
docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z 2020-01-14 21:24:47
docker-worker-gcp-googlecompute-2020-02-04t16-30-23z 2020-02-05 23:52:21
docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z 2020-02-13 04:16:54
The hg-clone timeouts on gcp happened in the window from 02-06 01:50:00 to 02-13 10:12:00.
Assignee | ||
Comment 74•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #73)
(In reply to Dave House [:dhouse] from comment #72)
The image history based on the commits (hgmo push time, UTC) is:
docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z 2020-01-14 21:24:47
docker-worker-gcp-googlecompute-2020-02-04t16-30-23z 2020-02-05 23:52:21
docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z 2020-02-13 04:16:54The hg-clone timeouts on gcp happened in the window from 02-06 01:50:00 to 02-13 10:12:00.
That was level3 images. The images for level1:
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z 2020-02-03 21:53:19
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-07t09-14-17z 2020-02-07 19:32:36 "Bug 1613800: Deploy new level 1 GCP workers; r=tomprince"
"" Wed, 12 Feb 2020 09:46:23 "Bug 1609568: Increase the memory of GCE instances r=jlorenzogg"
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-07t09-14-17z 2020-02-13 04:16:54 "Bug 1613800 - remove level 1 monopacker images, add and use replacement trusted gcp image r=aki"
The level1 instances are still using this disk image, docker-worker-gcp-googlecompute-2020-02-07t09-14-17z
The hg-clone timeouts on level1 happened in the window 02-12 02:02:00 to 02-13 09:58:00.
I looked for an audit log in taskcluster docs, and maybe the logs are where I need to check when ci-admin applied the changes to use the images.
Assignee | ||
Comment 75•5 years ago
|
||
I checked my level1 failures. I had the incorrect tagging; the ones I'd thought from treeherder were level1 were tier 1, but ran on the level3 project instances.
I'm checking over all of my data and assumptions+conclusions about that data, but from what I found last night the failures were only on level3:
2374 failures
only occurred in time window: 02-06 01:50:00 to 02-13 10:12:00
level3 images were (estimate based on ci-conf push timestamp, datetime listed before change):
worker_pool:gecko-3.b-linux-gcp
image: monopacker-docker-worker-trusted-current
2020-01-14 21:24:47
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z
2020-02-03 21:53:19
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
2020-02-04 20:00:43
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
2020-02-05 23:52:21
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
2020-02-06 09:34:33
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
2020-02-13 04:16:54
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z
Assignee | ||
Comment 76•5 years ago
|
||
Looking at project audit log the images were created:
level3 last two images (the config ref'd image on 2020-02-05 23:52:21 did not exist):
2/12/20
8:33 PM
Completed: Create image
mcrabill@mozilla.com created docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z
8:32 PM
Create image
mcrabill@mozilla.com created docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z
1/10/20
2:20 PM
Completed: Create image
mcrabill@mozilla.com created docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z
2:20 PM
Create image
mcrabill@mozilla.com created docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z
the image during the failure time was from the taskcluster-imaging project (I"m reviewing that audit log now).
Assignee | ||
Comment 77•5 years ago
•
|
||
tc-imaging audit last actions for the problem image:
2/12/20
3:21 PM
Completed: Delete image
mcrabill@mozilla.com deleted docker-worker-gcp-googlecompute-2019-11-04t19-24-29z
3:21 PM
Completed: Delete image
mcrabill@mozilla.com deleted docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
--
3:21 PM
Delete image
mcrabill@mozilla.com deleted docker-worker-gcp-l3-googlecompute-2019-10-08t17-57-06z
3:21 PM
Delete image
mcrabill@mozilla.com deleted docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
--
Create image
monopacker-ci@taskcluster-imaging.iam.gserviceaccount.com created docker-worker-gcp-googlecompute-2020-02-06t11-05-28z
2/4/20
9:41 AM
Completed: Create image
monopacker-ci@taskcluster-imaging.iam.gserviceaccount.com created docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
Assignee | ||
Comment 78•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #77)
2/4/20
9:41 AM
Completed: Create image
monopacker-ci@taskcluster-imaging.iam.gserviceaccount.com created docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
Miles, can I just set to that time-period commit (fc00dd508b) in monopacker and build the docker-worker and get the image in gcp?
Assignee | ||
Comment 79•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #73)
(In reply to Dave House [:dhouse] from comment #72)
The image history based on the commits (hgmo push time, UTC) is:
docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z 2020-01-14 21:24:47
docker-worker-gcp-googlecompute-2020-02-04t16-30-23z 2020-02-05 23:52:21
docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z 2020-02-13 04:16:54The hg-clone timeouts on gcp happened in the window from 02-06 01:50:00 to 02-13 10:12:00.
Correction: I thought grafana was showing me times in my timezone, but checking on the actual first and last failures I found the UTC datetimes match the times listed in my queries:
first: 2020-02-05T18:50:58.263Z
last: 2020-02-13T03:13:01.474Z
Assignee | ||
Comment 80•5 years ago
|
||
The non-documentation changes in monopacker between the failing image's time (commit fc00dd508b) and the image I've tested (built feb 12, so likely commit debfec96) are:
diff --git a/builders/docker_worker_gcp_l3.yaml b/builders/docker_worker_gcp_community.yaml
similarity index 88%
rename from builders/docker_worker_gcp_l3.yaml
rename to builders/docker_worker_gcp_community.yaml
index b53ff99..2fc336f 100644
--- a/builders/docker_worker_gcp_l3.yaml
+++ b/builders/docker_worker_gcp_community.yaml
@@ -1,4 +1,3 @@
-# copy of docker_worker_gcp
template: googlecompute
platform: linux
diff --git a/builders/docker_worker_gcp_trusted.yaml b/builders/docker_worker_gcp_trusted.yaml
new file mode 100644
index 0000000..aa98ef0
--- /dev/null
+++ b/builders/docker_worker_gcp_trusted.yaml
@@ -0,0 +1,18 @@
+# copy of docker_worker_gcp in l3 project
+template: googlecompute
+platform: linux
+
+builder_var_files:
+ - default_linux
+ - default_gcp
+ - googlecompute_bionic
+
+script_directories:
+ - ubuntu-bionic
+ - worker-runner-linux
+ - docker-worker-linux
+
+# Note: this project disallows port 22, so baking images requires
+# temporarily allowing access
+builder_vars:
+ project_id: fxci-production-level3-workers
diff --git a/template/builders/googlecompute.jinja2 b/template/builders/googlecompute.jinja2
index 04b49e5..256953d 100644
--- a/template/builders/googlecompute.jinja2
+++ b/template/builders/googlecompute.jinja2
@@ -5,3 +5,7 @@
ssh_username: {{builder.vars.ssh_username}}
project_id: {{builder.vars.project_id}}
zone: {{builder.vars.zone}}
+ {# enable nested virtualization. It allows to run the x86 android emulator
+ inside the VM with processor virtualization capabilities enabled #}
+ image_licenses:
+ - projects/vm-options/global/licenses/enable-vmx
The change between projects and enabling nested virtualization do not seem like they would change anything related to the hg clone.
Assignee | ||
Comment 81•5 years ago
|
||
It may be that the docker-worker version used was different.
The docker-worker versions around that time were:
docker_worker_version="v202002031613"
v202002031039 (not committed as a change in monopacker)
docker_worker_version="v202001201231"
docker_worker_version="v201912041449"
I built boot disk images using these four versions. I ran hg pulls with these four for the last ~6h without any hangs. I'm looking at the docker images now to see if maybe the container being used at that time had something different.
Assignee | ||
Comment 82•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #78)
(In reply to Dave House [:dhouse] from comment #77)
2/4/20
9:41 AM
Completed: Create image
monopacker-ci@taskcluster-imaging.iam.gserviceaccount.com created docker-worker-gcp-googlecompute-2020-02-04t16-30-23zMiles, can I just set to that time-period commit (fc00dd508b) in monopacker and build the docker-worker and get the image in gcp?
Thx Miles (we met and reviewed the monopacker build and the problem image)
Assignee | ||
Comment 83•5 years ago
|
||
Miles suggested the docker version or packages may have changed.
We have the image created Feb13, which has been in use and on which I could not reproduce the problem. On that, we have docker 19.03.6, build 369ce74a3c.
Docker 19.03.6 was released Feb12: https://docs.docker.com/engine/release-notes/#19036
Previous version 19.03.5 was released Nov14: https://docs.docker.com/engine/release-notes/#19035
containerd.io/now 1.2.10-3 amd64 [installed,local]
docker-ce/now 5:19.03.63-0ubuntu-bionic amd64 [installed,local]
docker-ce-cli/now 5:19.03.63-0ubuntu-bionic amd64 [installed,local]
So the Feb04 image would have had the older docker version. I'll try this previous version.
Assignee | ||
Comment 84•5 years ago
|
||
I've tested with docker 5:19.03.5 and a container from a feb-11 failed task (Mp3PpbPFTKW-rkdiT2x1qA https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288443741&repo=autoland&lineNumber=32) .
No slow/fail/hung clones yet across the three docker-worker versions around the failure time (a few zeros from when I started the tests incorrectly)
Images for test:
docker-worker-gcp-googlecompute-v202002031613-docker19-03-5 us 1.27 GB 10 GB taskcluster-imaging Mar 24, 2020, 2:45:27 PM
docker-worker-gcp-googlecompute-v202002031039-docker19-03-5 us 1.27 GB 10 GB taskcluster-imaging Mar 24, 2020, 2:35:14 PM
docker-worker-gcp-googlecompute-v202001201231-docker19-03-5
Assignee | ||
Comment 85•5 years ago
|
||
GCP support concluded they did not find more information about the deleted disk image, the instances that had the hg-timeout/failure, or find anything that looked unusual or that might explain the problem. We closed the support case.
Assignee | ||
Comment 86•5 years ago
|
||
I completed more tests on the docker version from the failure time and have not found anything.
I am looking next at pushing a full set of builds through gcp workers to test this. The per-instance cost is estimated in the gcp console as:
That's about $0.319 hourly
Pay for what you use: No upfront costs and per second billing
Item Estimated costs
32 vCPUs + 72 GB memory $202.34/month
10 GB standard persistent disk $0.40/month
375 GB local SSD scratch disk $30.00/month
Sustained use discount
- $0.00/month
Total $232.74/month
Assignee | ||
Comment 87•5 years ago
|
||
During a regular week, our b-linux numbers peak around 750:
https://earthangel-b40313e5.influxcloud.net/d/MZihUsuWz/gcp-aws-builder-counts?orgId=1&from=now-7d&to=now
Maximum instance count for aws builders this week so far was 734, and 110 on gcp. Last week it was 832, and 103. And 6 weeks ago, Feb 16-22, was 769. The maximum over the last 90days were just under 1100 simultaneous builders.
Assignee | ||
Comment 88•5 years ago
|
||
April 22/23, I ran a production size load copying production tasks and there were no timeout problems (similar failures to what was on aws at the time.):
2402 total tasks ran:
2212 completed
190 failed (133 connection reset on hg clone/pull (similar to the timeouts seen in February, but caught immediately -- 70 of these were also aborted prematurely in aws), 2 worker container timed-out, 5 screen session fails, 32 TBPL failures, 8 build failures, 6 no logs)
Of the failures on gcp,
47 of those failed in aws
70 were aborted prematurely on aws
So, we saw 3% (73 of 2402 tasks) fail on gcp that did not fail on aws. All from a connection reset with the hg pull/clone.
Assignee | ||
Comment 89•5 years ago
|
||
Assignee | ||
Comment 90•5 years ago
|
||
Connor, do you know if we see more connection resets for bundle downloads in gcp than in aws? I attached the 10lines tail for each of 133 failures with hg connection resets I saw in this set from April 23-24 (I'll collect more recent ones also, and look for the same failures after updating the worker image).
It looks like they were all against the bundle:
"applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/559b28bce1555d02dc1e19d63e4a64e4d54130ed.stream-v2.hg"
Assignee | ||
Comment 91•5 years ago
|
||
I'm wondering if, although google support did not find problems, we are exceeding the limits on the single-site hg bundles (in gcp's us-central1) re: bug 1587958. I did not count the number of simultaneous downloads at the failures times.
Assignee | ||
Comment 92•5 years ago
|
||
It looks like the connection resets on bundless are grouped by the time they occur.
I copied 100 tasks yesterday (from gecko-3/b-linux to gecko-3/b-linux-gcp) and 7 hit connection resets on downloading the bundle. 2 of those 7 saw a worker-shutdown interruption on aws.
All seven failures happened within about 20 seconds (the one in-progress download is logged as failed 30s later):
[vcs 2020-05-20T21:44:53.048Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:44:53.114Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:44:53.783Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:45:07.208Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:45:07.496Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:45:08.191Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:44:46.501Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:44:46.558Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:44:47.043Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:45:37.770Z] clone [=========================> ] 2072127653/3582756450 40s
[vcs 2020-05-20T21:45:37.770Z]
[vcs 2020-05-20T21:45:37.770Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:45:04.744Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:45:04.806Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:45:04.973Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:44:50.632Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:44:50.728Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:44:51.318Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:44:50.474Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:44:50.533Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:44:50.711Z] abort: Connection reset by peer
Comment 93•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #91)
I'm wondering if, although google support did not find problems, we are exceeding the limits on the single-site hg bundles (in gcp's us-central1)
So I just took a look at the GCS quotas and limits page and as far as I can tell, there is no limit on the number of concurrent reads for an object: "There is no limit to reads of an object. Buckets initially support roughly 5000 object reads per second and then scale as needed."
It's not a surprise that we see all connections happening against the same bundle. Bundles are generated nightly and come in multiple formats, one of which is the "stream" bundle which if the fastest to download on high-speed networks. So we would expect the same GCS-hosted stream bundle to be downloaded by most clients in GCP-CI on a given day.
To answer your question more directly: I'm not sure if we see more connection drops on GCP than AWS.
Assignee | ||
Comment 94•5 years ago
|
||
(In reply to Connor Sheehan [:sheehan] from comment #93)
(In reply to Dave House [:dhouse] from comment #91)
I'm wondering if, although google support did not find problems, we are exceeding the limits on the single-site hg bundles (in gcp's us-central1)
So I just took a look at the GCS quotas and limits page and as far as I can tell, there is no limit on the number of concurrent reads for an object: "There is no limit to reads of an object. Buckets initially support roughly 5000 object reads per second and then scale as needed."
It's not a surprise that we see all connections happening against the same bundle. Bundles are generated nightly and come in multiple formats, one of which is the "stream" bundle which if the fastest to download on high-speed networks. So we would expect the same GCS-hosted stream bundle to be downloaded by most clients in GCP-CI on a given day.
To answer your question more directly: I'm not sure if we see more connection drops on GCP than AWS.
Thanks!
re: hg issues with shoddy connections: Is there a retry adjustment we could make on the clone to work-around a 30-60second down-time for the bundles in gcp? I'm assuming that the grouped failures I saw last night are a pattern for the issue. If there isn't a config for hg, then I may look at just retrying the task step of the clone, on a connection failure.
Comment 95•5 years ago
|
||
I don't think there's a config for hg that specifically targets the clonebundle download step.
One of the things robustcheckout
does is add retry logic around various network operations. We can patch it to retry when performing the bundle download (either in this bug or a new one). Having TC retry when a bundle download fails could work as well.
Comment 96•4 years ago
|
||
Are we still debugging this? If not, it may be time to shut these off entirely for now.
Assignee | ||
Comment 97•4 years ago
|
||
If switching to gcp becomes something we need to look at again this year, then I think we need to debug the clonebundle timeout.
I would like to keep the mirror builds going in gcp in-frequently so that we know we could switch over. However, I could be wrong that that provides any value.
What do you think? Or, who is deciding when the builds move to gcp?
Comment 98•4 years ago
|
||
It's worth noting that the GCP builds are still slower than those on AWS, both because the machines are slower despite having more cores, and because the backing storage (equivalent to S3) for sccache is slower too.
Comment 99•4 years ago
|
||
(In reply to Dave House [:dhouse] from comment #97)
If switching to gcp becomes something we need to look at again this year, then I think we need to debug the clonebundle timeout.
I would like to keep the mirror builds going in gcp in-frequently so that we know we could switch over. However, I could be wrong that that provides any value.
What do you think? Or, who is deciding when the builds move to gcp?
After digging into cost ($200-$300/month), I'm fine to leave them on. I don't know who is making the decision on whether or not to move, though.
Comment 100•2 years ago
|
||
Builds have been running in GCP for awhile now and shippable just switched over recently. I think we should close this and re-open a new bug if we want to prioritize improving performance again.
Description
•