Closed Bug 1614852 Opened 4 years ago Closed 2 years ago

Frequent build timeouts when (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)

Categories

(Firefox Build System :: Task Configuration, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: CosminS, Assigned: dhouse)

References

Details

(Whiteboard: [stockwell infra])

Attachments

(4 files)

Flags: needinfo?(sheehan)
Flags: needinfo?(mozilla)

This is blocking reopening autoland cause the failures rate is still high and there can't be test coverage this way.

Severity: critical → blocker

We've rerun most of the busted builds and the majority of them were green on first try. There are still failures but they don't seem so frequent so we've reopened autoland.

Severity: blocker → normal

This is also blocking beta builds due to frequent failures. Can we get this fixed or the GCP switch reverted today?

Flags: needinfo?(wcosta)
Flags: needinfo?(coop)

Looking at some of the logs, it seems there are two problems.

  1. the download speed for the bundles is much lower than usual. In https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288433462&repo=autoland&lineNumber=102 it's only 30Mib/s. Testing on my own instance in GCE I was able to hit 46.7 MB/sec and download the repo bundle in 70s. Usual numbers are around 55s for the download.

  2. the output logs seem to go blank after a few seconds, showing no output from hg, and then fail with a timeout.

I wonder if there is some sort of disk I/O issue that could be causing both the slow timeouts and the lack of logging output. Can someone take a look at the disk I/O graphs (probably in StackDriver?) and see if anything looks awry for the new build workers compared to the old ones? If it's possible to give me access to those dashboards to debug that would work too.

Flags: needinfo?(sheehan)

I wonder if there is some sort of disk I/O issue that could be causing both the slow timeouts and the lack of logging output. Can someone take a look at the disk I/O graphs (probably in StackDriver?) and see if anything looks awry for the new build workers compared to the old ones? If it's possible to give me access to those dashboards to debug that would work too.

hrm, I don't know who has access to stack driver in production, 303 :fubar

Flags: needinfo?(wcosta) → needinfo?(klibby)

I think I've given Connor user access to metrics in the fxci-production-level{1,3}-workers projects. Looking at the graphs, it does look pretty spiky and there are some definite outliers, but generally looks ok?

Dave's going to help ya'll debug and get more access as needed.

Flags: needinfo?(klibby)

We discussed the problem in #vcs, on chat.m.o, and determined to roll-back because the cause is not clear and appears to be limited to the worker or project configuration.

Wander requested ssh access, and we both found the ssh connections to the instances (tested with level1) fail.
With a confirmed key added (new, verified through the serial port 1/console), I still get a failed ssh connection. I'm not sure if there is a VPN or firewall rules set up (looking).

(In reply to Dave House [:dhouse] from comment #9)

Wander requested ssh access, and we both found the ssh connections to the instances (tested with level1) fail.
With a confirmed key added (new, verified through the serial port 1/console), I still get a failed ssh connection. I'm not sure if there is a VPN or firewall rules set up (looking).

The instances are configured with external IPs, and the firewall rules allow icmp, livelog, from external.
I've added an exception for ssh for myself.

Wander, to ssh in please add or edit a "ssh-external" firewall rule to allow your IP to ssh in, like:
ssh-external
Ingress
Apply to all
IP ranges: yourip/32
tcp:22
Allow
1000
default
and then ssh with gcloud or ssh directly (the web ssh from gcp's console requires adding google's ip ranges to the rule or opening it up wide).

Flags: needinfo?(wcosta)

Hi, are there any updates on the rollback? Is there an ETA?

Flags: needinfo?(dhouse)
Blocks: 1547111

This patch essentially backs out 5b47f5e53d2a, but keeps the gecko-{1,3}/b-linux-aws pools
until we merge everywhere.

  • renames gecko-{1,3}/b-linux to gecko-{1,3}/b-linux-gcp, and reduces their max capacity.
  • copies the gecko-{1,3}/b-linux-aws configs to gecko-{1,3}/b-linux and ups their max capacity.
Assignee: nobody → aki
Status: NEW → ASSIGNED

I've applied the patch via ci-admin, so we should be switching over to aws builds shortly.

I decided that leaving this unlanded was probably playing with fire, so I landed it r=bustage https://hg.mozilla.org/ci/ci-configuration/rev/153c4ba1772cec9c2dc3c8e13e06f85d0cc57da1 . Still open for review or followup as needed.

Getting a lot of errors like

Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z'. The referenced image resource cannot be found.

Not sure what to do about this, other than leave it for Wander's/Miles' morning?

Fyi changes from bug 1547111 got backed out here:

https://hg.mozilla.org/integration/autoland/rev/7dcbf3debe6a3a527df82c422446ca42a3de78c4
https://hg.mozilla.org/mozilla-central/rev/5fa1a31cb9547a0400f0955707e46b3046680123

We're no longer hitting this issue after the rollback, except on existing gcp gecko-{1,3}/b-linux workers that are timing out running tasks .

Flags: needinfo?(dhouse)

(In reply to Dave House [:dhouse] from comment #10)

... (the web ssh from gcp's console requires adding google's ip ranges to the rule or opening it up wide).

Wander, I added a rule to allow the gcp console browser ssh connections in l1. So you can use that if you need to attach and test/debug. I'll remove it when you're done or in a few days.

(In reply to Aki Sasaki [:aki] (he/him) (UTC-7) from comment #15)

Getting a lot of errors like

Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z'. The referenced image resource cannot be found.

Not sure what to do about this, other than leave it for Wander's/Miles' morning?

https://hg.mozilla.org/ci/ci-configuration/rev/c4be240dd8f6c83fe3f548ab946096df223fc56a should fix this issue.
dhouse is killing old gcp gecko-{1,3}/b-linux workers.

No timeouts since workers got killed.

(In reply to Dave House [:dhouse] from comment #10)

(In reply to Dave House [:dhouse] from comment #9)

Wander requested ssh access, and we both found the ssh connections to the instances (tested with level1) fail.
With a confirmed key added (new, verified through the serial port 1/console), I still get a failed ssh connection. I'm not sure if there is a VPN or firewall rules set up (looking).

The instances are configured with external IPs, and the firewall rules allow icmp, livelog, from external.
I've added an exception for ssh for myself.

Wander, to ssh in please add or edit a "ssh-external" firewall rule to allow your IP to ssh in, like:
ssh-external
Ingress
Apply to all
IP ranges: yourip/32
tcp:22
Allow
1000
default
and then ssh with gcloud or ssh directly (the web ssh from gcp's console requires adding google's ip ranges to the rule or opening it up wide).

I still have no permission:

$ gcloud beta compute --project "fxci-production-level1-workers" ssh --zone "us-central1-c" "gecko-1-b-linux-gcp-a-zmqnybq2gwr7tzzputtg"
WARNING: The public SSH key file for gcloud does not exist.
WARNING: The private SSH key file for gcloud does not exist.
WARNING: You do not have an SSH key for gcloud.
WARNING: SSH keygen will be executed to generate a key.
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
'Your identification has been saved in /home/vagrant/.ssh/google_compute_engine.
Your public key has been saved in /home/vagrant/.ssh/google_compute_engine.pub.
The key fingerprint is:
SHA256:47xxSavHa81GdWP6fcsA2X9hVc5Ohk3iBl7B6WY627c vagrant@vagrant
The key's randomart image is:
+---[RSA 3072]----+
|            ..+oo|
|           . +oB.|
|            ..+ B|
|            o.=*o|
|        S .o *o+o|
|       o o o=.o .|
|        +.++ =..o|
|         =+ = +.=|
|        oo.o   E+|
+----[SHA256]-----+
Updating project ssh metadata...failed.
Updating instance ssh metadata...failed.
ERROR: (gcloud.beta.compute.ssh) Could not add SSH key to instance metadata:
 - The user does not have access to service account 'taskcluster-worker@fxci-production-level1-workers.iam.gserviceaccount.com'.  User: 'wcosta@mozilla.com'.  Ask a project owner to grant you the iam.serviceAccountUser role on the service account
Flags: needinfo?(wcosta) → needinfo?(dhouse)

(In reply to Wander Lairson Costa from comment #21)

Please try again now; I re-added your access to level1. Your access was removed-- I'll need to check if there is automation that removed it for if someone manually did so.

Flags: needinfo?(dhouse) → needinfo?(wcosta)

(In reply to Dave House [:dhouse] from comment #22)

(In reply to Wander Lairson Costa from comment #21)

Please try again now; I re-added your access to level1. Your access was removed-- I'll need to check if there is automation that removed it for if someone manually did so.

We talked in matrix and I can login to the machines now. :aki, we are having problem to reproduce to problem since the rollback, could you please redeploy ci-configuration in staging?

Flags: needinfo?(wcosta) → needinfo?(aki)

I'm reducing the severity on this bug since it is no longer a "blocker" for ci.

Severity: blocker → normal
Flags: needinfo?(coop)

Adding some of the discussion yesterday from debugging the problem in #vcs in chat.m.o:

dhouse
sheehan: what do you think at this point? With your test on a separate gcp instance showing good performance, can we determine the problem is with the instances image or configuration and not the pipe to hg.m.o (or cache/other?)?
sheehan
@dh
dhouse:
ugh. trying to select someones name to ping them requires the mosue apparently. sigh.
I don't know at this point. the tasks are all failing to download a file from a GCS bucket in the same region when they hit these slowness/no logs problems. I doubt it's an issue with instances downloading files from within google
dhouse
walac: since these are new, can you switch the tasks back to all on aws until this problem is resolved? (and this can be worked through in staging?)
sheehan
dhouse: fwiw we previously had a subset of builds running in GCP, and they didn't have this issue AFAIK. is there any difference between the workers from that subset and the new workers?

(In reply to Wander Lairson Costa from comment #23)

We talked in matrix and I can login to the machines now. :aki, we are having problem to reproduce to problem since the rollback, could you please redeploy ci-configuration in staging?

Done.

Flags: needinfo?(aki)
Whiteboard: [stockwell disable-recommended] → stockwell infra
Whiteboard: stockwell infra → [stockwell infra]

I dealt with switching us back, but we probably need a different owner to fix the underlying issue.

Assignee: aki → nobody
Status: ASSIGNED → NEW

Wander, do you have more information on the problem? (Maybe notes on what may need to be done, or links to the PRs and bugs?) I don't have a complete view of what happened before I joined to check the disk-io/network issues for this bug. Or if this is being worked on in another bug, could you close this out and link to there?

Flags: needinfo?(wcosta)

(In reply to Dave House [:dhouse] from comment #29)

Wander, do you have more information on the problem? (Maybe notes on what may need to be done, or links to the PRs and bugs?) I don't have a complete view of what happened before I joined to check the disk-io/network issues for this bug. Or if this is being worked on in another bug, could you close this out and link to there?

afaik, the tracking bug for the vcs issue is this one.

It is worth mention the patches that switch from aws to gcp:
ci-configuration: https://phabricator.services.mozilla.com/D60042
mozilla-central: https://phabricator.services.mozilla.com/D60042

Flags: needinfo?(wcosta)
Assignee: nobody → dhouse

Connor, are there pieces that need to be set up for caching/bundles for workers in gcp? (bugs that this bug can depend on) Or is that ready and do we just need to test this (in staging) at scale again?

Flags: needinfo?(sheehan)

The caching/bundles are already set up, the only assumption is that all the tasks are taking place in us-central1.

Flags: needinfo?(sheehan)

Hi Wander, did you get a chance to test this in staging? Or is there some setup that needs to be done (that I can help with) to stage builds in gcp?

Flags: needinfo?(wcosta)

(In reply to Dave House [:dhouse] from comment #33)

Hi Wander, did you get a chance to test this in staging? Or is there some setup that needs to be done (that I can help with) to stage builds in gcp?

If we can get the builds running in staging, maybe we can try to reproduce the timeouts there with scale matching what we had in production.

In case we cannot reproduce it, do you know how we can configure the "shadow builds" (I'm not sure what they're called -- since the aws builds are prod/release, and there are shadow/unused builds running on gcp?) to run with the new image?

(In reply to Dave House [:dhouse] from comment #34)

If we can get the builds running in staging, maybe we can try to reproduce the timeouts there with scale matching what we had in production.

In case we cannot reproduce it, do you know how we can configure the "shadow builds" (I'm not sure what they're called -- since the aws builds are prod/release, and there are shadow/unused builds running on gcp?) to run with the new image?

Kendall: what's the expectation on your side? Should Wander be running further tests?

The larger question is whether staging performance is an adequate proxy for production. We weren't hitting these timeouts with only tier 3 load, and I don't know if we are equipped to simulate a full production workload in staging. We should figure that out if we want staging results to matter.

Flags: needinfo?(klibby)

Wander, I will do the testing and I may have questions for you. Here are my notes so far:

At the failure time, configs were at:
https://hg.mozilla.org/ci/ci-configuration/file/b44ac91dc9706738c4c9d01bcf3d530651e6fb5a/worker-pools.yml#l103
https://hg.mozilla.org/mozilla-central/file/aa01bf29eae7993e8bfb147c16a789a16bc45617/taskcluster/ci/build

Current state looking at the worker+pool configurations at https://hg.mozilla.org/ci/ci-configuration/file/c5027a4f096d2da89bff62e27e64dd4bff4258c1/worker-pools.yml#l86 (current; after https://phabricator.services.mozilla.com/D62672) and the task to pool mappings at https://hg.mozilla.org/mozilla-central/file/b7a8b4961e01b8a38be141bb89cc93f1ad404e4e/taskcluster/ci/build (after https://phabricator.services.mozilla.com/rMOZILLACENTRAL7dcbf3debe6a3a527df82c422446ca42a3de78c4):
The instance image and type and pools are still in gcp but without the tasks routed to them.
The tier 3 validation builds (*"shadow") are in gcp. tier1 builds are in aws.

The change flipping aws with gcp was:
worker+pool config: ci-configuration:/worker-pools.yml (names used in m-c ymls like "workerType: b-linux-aws" s/-gcp//g, "provider_id: aws")
task to pool mapping: m-c:/taskcluster/ci/build/{android,linux,macosx}.yml (s/gcp/aws/g)
The "revert" kept the pools in gcp (flipped the tier1/3 task mapping in m-c and restored the aws pools in ci-configuration).

I need to collect data on where the failures happened (worker type, confirming pools and instance type/specs), and the infra config (cache, region, ?) during failures and now. Then i can look at reproducing with the "prod" but-turned-off gcp pools as "staging" since they match (they are) production.

(In reply to Dave House [:dhouse] from comment #36)

Wander, I will do the testing and I may have questions for you. Here are my notes so far:

At the failure time, configs were at:
https://hg.mozilla.org/ci/ci-configuration/file/b44ac91dc9706738c4c9d01bcf3d530651e6fb5a/worker-pools.yml#l103
https://hg.mozilla.org/mozilla-central/file/aa01bf29eae7993e8bfb147c16a789a16bc45617/taskcluster/ci/build

Current state looking at the worker+pool configurations at https://hg.mozilla.org/ci/ci-configuration/file/c5027a4f096d2da89bff62e27e64dd4bff4258c1/worker-pools.yml#l86 (current; after https://phabricator.services.mozilla.com/D62672) and the task to pool mappings at https://hg.mozilla.org/mozilla-central/file/b7a8b4961e01b8a38be141bb89cc93f1ad404e4e/taskcluster/ci/build (after https://phabricator.services.mozilla.com/rMOZILLACENTRAL7dcbf3debe6a3a527df82c422446ca42a3de78c4):
The instance image and type and pools are still in gcp but without the tasks routed to them.
The tier 3 validation builds (*"shadow") are in gcp. tier1 builds are in aws.

The change flipping aws with gcp was:
worker+pool config: ci-configuration:/worker-pools.yml (names used in m-c ymls like "workerType: b-linux-aws" s/-gcp//g, "provider_id: aws")
task to pool mapping: m-c:/taskcluster/ci/build/{android,linux,macosx}.yml (s/gcp/aws/g)
The "revert" kept the pools in gcp (flipped the tier1/3 task mapping in m-c and restored the aws pools in ci-configuration).

I need to collect data on where the failures happened (worker type, confirming pools and instance type/specs), and the infra config (cache, region, ?) during failures and now. Then i can look at reproducing with the "prod" but-turned-off gcp pools as "staging" since they match (they are) production.

Your description is pretty accurate, I believe the change also include deleting the current b-linux worker pool in ci-configuration, as it is configured for AWS. 302 to :tomprince to see if he sees something wrong.

Flags: needinfo?(wcosta)

(In reply to Wander Lairson Costa from comment #37)

Your description is pretty accurate, I believe the change also include deleting the current b-linux worker pool in ci-configuration, as it is configured for AWS. 302 to :tomprince to see if he sees something wrong.

Thank you for confirming!

Reviewing the failures from comment 0. It looks like they stopped logging mid-pull and then, long after, the task timed-out. Possibly, the log was cut off or something hung.

Two pulled from cached bundles, https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified, and the other did not reach the pull step (expecting it did or would have used the bundle but I think the log is interrupted).

build-linux64-tsan-fuzzing/opt
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288433462&repo=autoland&lineNumber=102

[taskcluster 2020-02-12 00:03:08.515Z] Worker Group: fxci-level3-gcp
[taskcluster 2020-02-12 00:03:08.515Z] Worker Node Type: projects/970387039909/machineTypes/custom-32-36864
[taskcluster 2020-02-12 00:03:08.515Z] Worker Type: b-linux
[...]
[vcs 2020-02-12T00:03:38.588Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2020-02-12T00:03:38.690Z] region google/us-central1 not yet supported; using public hg.mozilla.org service
[vcs 2020-02-12T00:03:38.691Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2020-02-12T00:03:38.770Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', 'b4dc366c916164498a62980e867c0cf5647e6d41', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/workspace/build/src']
[vcs 2020-02-12T00:03:38.865Z] (using Mercurial 4.8.1)
[vcs 2020-02-12T00:03:38.865Z] ensuring https://hg.mozilla.org/integration/autoland@b4dc366c916164498a62980e867c0cf5647e6d41 is available at /builds/worker/workspace/build/src
[vcs 2020-02-12T00:03:39.446Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2020-02-12T00:03:39.780Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2020-02-12T00:03:40.225Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/b8ad0267e88abff0a54a6f0f1b5de2ad3a018460.stream-v2.hg
[vcs 2020-02-12T00:03:40.296Z] 561909 files to transfer, 3.19 GB of data
[...]
[vcs 2020-02-12T00:04:46.164Z] clone [=====================>                     ] 1787855560/3426007566 1m00s
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2020-02-12 01:33:36.083Z] === Task Finished ===

re-ran pull time (local cache, "sharing from existing pooled repository 8ba995b..."):
02:49:03 to 02:49:33

artifact-build-linux64-artifact/opt
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288437926&repo=autoland&lineNumber=36

[taskcluster 2020-02-11 23:59:28.653Z] Worker Group: fxci-level3-gcp
[taskcluster 2020-02-11 23:59:28.653Z] Worker Node Type: projects/970387039909/machineTypes/custom-32-36864
[taskcluster 2020-02-11 23:59:28.653Z] Worker Type: b-linux
[...]
[vcs 2020-02-12T00:35:23.592Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2020-02-12T00:35:23.705Z] region google/us-central1 not yet supported; using public hg.mozilla.org service
[vcs 2020-02-12T00:35:23.705Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2020-02-12T00:35:23.793Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', 'b4dc366c916164498a62980e867c0cf5647e6d41', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/workspace/build/src']
[vcs 2020-02-12T00:35:23.880Z] (using Mercurial 4.8.1)
[vcs 2020-02-12T00:35:23.880Z] ensuring https://hg.mozilla.org/integration/autoland@b4dc366c916164498a62980e867c0cf5647e6d41 is available at /builds/worker/workspace/build/src
[vcs 2020-02-12T00:35:24.547Z] (shared store has an active lock; assuming it is left over from a previous process and that the store is corrupt; deleting store and destination just to be sure)
[vcs 2020-02-12T00:35:33.303Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2020-02-12 01:35:21.169Z] === Task Finished ===

re-ran pull time:
[vcs 2020-02-12T03:53:35.560Z] transferred 3.19 GB in 98.0 seconds (33.3 MB/sec)

build-macosx64/debug
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288432758&repo=autoland&lineNumber=41

[taskcluster 2020-02-11 23:59:28.653Z] Worker Group: fxci-level3-gcp
[taskcluster 2020-02-11 23:59:28.653Z] Worker Node Type: projects/970387039909/machineTypes/custom-32-36864
[taskcluster 2020-02-11 23:59:28.653Z] Worker Type: b-linux
[...]
[vcs 2020-02-11T23:59:53.801Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2020-02-11T23:59:53.925Z] region google/us-central1 not yet supported; using public hg.mozilla.org service
[vcs 2020-02-11T23:59:53.925Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2020-02-11T23:59:54.004Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '02f22447369053837e8bc80c436224a8eb62bf16', 'https://hg.mozilla.org/integration/autoland', '/builds/worker/workspace/build/src']
[vcs 2020-02-11T23:59:54.100Z] (using Mercurial 4.8.1)
[vcs 2020-02-11T23:59:54.101Z] ensuring https://hg.mozilla.org/integration/autoland@02f22447369053837e8bc80c436224a8eb62bf16 is available at /builds/worker/workspace/build/src
[vcs 2020-02-11T23:59:54.557Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2020-02-11T23:59:54.709Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2020-02-11T23:59:55.099Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/b8ad0267e88abff0a54a6f0f1b5de2ad3a018460.stream-v2.hg
[vcs 2020-02-11T23:59:55.270Z] 561909 files to transfer, 3.19 GB of data
[vcs 2020-02-11T23:59:57.355Z] 
[vcs 2020-02-11T23:59:58.355Z] clone [                                           ]   56359221/3426007566 2m05s
[vcs 2020-02-11T23:59:59.394Z] clone [>                                          ]   87628124/3426007566 1m58s
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2020-02-12 01:29:51.256Z] === Task Finished ===

not re-ran

(In reply to Dave House [:dhouse] from comment #39)

Reviewing the failures from comment 0. It looks like they stopped logging mid-pull and then, long after, the task timed-out. Possibly, the log was cut off or something hung.

this reminded me of something glob mentioned: if you pull a disk out from under hg while it's trying to update/clone, it'll simply hang for a long time with no output, not unlike what happens when an NFS mount goes away. so their thought was that the issue may be due to underlying storage problems/config/etc.

(In reply to Chris Cooper [:coop] pronoun: he from comment #35)

Kendall: what's the expectation on your side? Should Wander be running further tests?

Nah, Dave's driving this. If Wander can be available to answer questions, then I think we're good.

The larger question is whether staging performance is an adequate proxy for production. We weren't hitting these timeouts with only tier 3 load, and I don't know if we are equipped to simulate a full production workload in staging. We should figure that out if we want staging results to matter.

Agreed. This will not be fun if it only appears under full load.

Flags: needinfo?(klibby)

(In reply to Kendall Libby [:fubar] (he/him) from comment #40)

(In reply to Dave House [:dhouse] from comment #39)

Reviewing the failures from comment 0. It looks like they stopped logging mid-pull and then, long after, the task timed-out. Possibly, the log was cut off or something hung.

this reminded me of something glob mentioned: if you pull a disk out from under hg while it's trying to update/clone, it'll simply hang for a long time with no output, not unlike what happens when an NFS mount goes away. so their thought was that the issue may be due to underlying storage problems/config/etc.

Thank you! I'll keep this in mind and may put in some debug logging to try to catch this kind of hang and monitor the disk state.

Reviewing the timed-out tasks from https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358, they mostly hit the same failures as the 3 examples in comment 0; interrupted clone, or "cloning" message and then nothing.

Examples (last log line before the "Task timeout" message):

~/dev/releng$ grep -B2 "Task timeout" *live_backing.log | grep clone
1.android-4-0-armv7-api16.debug.7140686373634092707.1581453475.1581460703.live_backing.log-[vcs 2020-02-11T20:39:19.401Z] clone [========================>                    ] 1945217846/3426007566 39s
1.android-4-2-x86.opt.5135040868078535476.1581448740.1581455970.live_backing.log-[vcs 2020-02-11T19:20:38.873Z] clone [==================>                        ] 1571017784/3426007566 1m16s
1.android-5-0-aarch64.debug.2221161125476516910.1581448079.1581455318.live_backing.log-[vcs 2020-02-11T19:09:35.612Z] clone [====================>                        ] 1669120405/3426007566 55s
1.android-5-0-aarch64.opt.9218729948170072210.1581453025.1581460266.live_backing.log-[vcs 2020-02-11T20:31:59.359Z] clone [=================>                         ] 1504703846/3426007566 1m02s
1.android-5-0-x86_64.opt.7208775679883523062.1581452652.1581459886.live_backing.log-[vcs 2020-02-11T20:25:40.188Z] clone [=============>                             ] 1179549325/3426007566 1m32s
1.linux64-devedition.opt.4048035201520608580.1581521897.1581527322.live_backing.log-[vcs 2020-02-12T15:39:32.167Z] clone [====================>                        ] 1668359888/3427138758 49s
1.linux64.asan.7155597052681846487.1581465741.1581471165.live_backing.log-[vcs 2020-02-12T00:03:41.011Z] clone [=======================>                     ] 1863318163/3426007566 43s
1.linux64.asan.8212565173965073315.1581454747.1581460175.live_backing.log-[vcs 2020-02-11T21:00:44.736Z] clone [==================>                        ] 1581098989/3426007566 1m17s
1.linux64.debug.2636751713044725744.1581465972.1581468696.live_backing.log-[vcs 2020-02-12T00:07:34.630Z] clone [======================>                      ] 1777894088/3426007566 51s
1.linux64.debug.5510109461410872770.1581532006.1581534730.live_backing.log-[vcs 2020-02-12T18:28:08.704Z] clone [=========================>                   ] 2050883465/3427138758 37s
1.linux64.debug.8963396912676798448.1581465974.1581468698.live_backing.log-[vcs 2020-02-12T00:07:32.458Z] clone [=================>                         ] 1512134115/3426007566 1m03s
1.linux64.opt.4779619533921929029.1581447680.1581454911.live_backing.log-[vcs 2020-02-11T19:02:51.421Z] clone [======================>                      ] 1779466952/3426007566 53s
1.linux64.opt.8784018350883684421.1581448955.1581456179.live_backing.log-[vcs 2020-02-11T19:24:10.149Z] clone [==================>                        ] 1592507805/3426007566 1m17s
1.linux64.tsan.4992956426439574725.1581447809.1581453233.live_backing.log-[vcs 2020-02-11T19:05:00.361Z] clone [========================>                    ] 1907023313/3426007566 50s
1.linux64.tsan.6699715376909844223.1581465676.1581471106.live_backing.log-[vcs 2020-02-12T00:02:41.787Z] clone [====================>                        ] 1659055240/3426007566 55s
1.linux64.tsan.740457908796325518.1581465787.1581471217.live_backing.log-[vcs 2020-02-12T00:04:46.164Z] clone [=====================>                     ] 1787855560/3426007566 1m00s
1.linux64.tsan.956158274572320471.1581465746.1581471170.live_backing.log-[vcs 2020-02-12T00:03:55.873Z] clone [====================>                      ] 1714493512/3426007566 1m02s
1.osx-cross.debug.1341284457246523837.1581465568.1581470992.live_backing.log-[vcs 2020-02-11T23:59:59.394Z] clone [>                                          ]   87628124/3426007566 1m58s
1.osx-shippable.opt.6899541328811475486.1581448530.1581452155.live_backing.log-[vcs 2020-02-11T19:17:09.471Z] clone [===========================>                 ] 2178740183/3426007566 41s
1.toolchains.opt.8075127049421755574.1581460131.1581463760.live_backing.log-[vcs 2020-02-11T22:30:06.591Z] clone [================>                          ] 1388857754/3426007566 1m02s
1.windows2012-32-devedition.opt.2258701274242243645.1581507348.1581514571.live_backing.log-[vcs 2020-02-12T11:37:08.555Z] clone [=========================>                   ] 2038976193/3427138758 38s
1.windows2012-32-devedition.opt.935603359957235370.1581516930.1581522354.live_backing.log-[vcs 2020-02-12T14:16:42.283Z] clone [=======================>                     ] 1833288050/3427138758 39s
1.windows2012-32-shippable.opt.2203446541454065495.1581449745.1581456971.live_backing.log-[vcs 2020-02-11T19:37:12.615Z] clone [=======================>                     ] 1889987992/3426007566 47s
1.windows2012-aarch64-shippable.opt.6611326461786301728.1581521950.1581529174.live_backing.log-[vcs 2020-02-12T15:40:47.577Z] clone [=======================>                     ] 1844440000/3427138758 59s
2.linux64-aarch64.opt.6142367542606216485.1581450267.1581457507.live_backing.log-[vcs 2020-02-11T19:45:51.924Z] clone [===================>                         ] 1537001821/3426007566 48s
2.linux64.debug.4320382539503960048.1581465975.1581471428.live_backing.log-[vcs 2020-02-12T00:07:34.119Z] clone [==========>                                  ]  883863620/3426007566 58s
2.linux64.opt.4304945460674466998.1581460133.1581463761.live_backing.log-[vcs 2020-02-11T22:30:22.245Z] clone [========================>                    ] 1979023452/3426007566 42s
2.windows-mingw32.all.4522614149832682231.1581454456.1581461680.live_backing.log-[vcs 2020-02-11T20:55:32.809Z] clone [====================>                        ] 1607309398/3426007566 55s
~/dev/releng$ grep -B2 "Task timeout" *live_backing.log | grep cloning
1.android-4-0-armv7-api16.debug.6899541328811475486.1581452158.1581459387.live_backing.log-[vcs 2020-02-11T20:16:43.503Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-0-armv7-api16.pgo.4992956426439574725.1581453239.1581460477.live_backing.log-[vcs 2020-02-11T20:34:50.492Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-0-armv7-api16.pgo.8730667323694669983.1581524945.1581532169.live_backing.log-[vcs 2020-02-12T16:29:32.061Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-0-geckoview-fat-aar.opt.7120138743283557661.1581534867.1581542089.live_backing.log-[vcs 2020-02-12T19:14:52.361Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-2-x86.debug.8140242803783263007.1581524946.1581532169.live_backing.log-[vcs 2020-02-12T16:29:32.720Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-4-2-x86.opt.2913941764092885856.1581524946.1581532171.live_backing.log-[vcs 2020-02-12T16:29:34.809Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.android-5-0-x86_64.debug.2042436425209502781.1581524945.1581532169.live_backing.log-[vcs 2020-02-12T16:29:33.136Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32-shippable.opt.1136869659206653110.1581467649.1581471266.live_backing.log-[vcs 2020-02-12T00:34:34.699Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32-shippable.opt.2707280311610152118.1581468012.1581471632.live_backing.log-[vcs 2020-02-12T00:40:45.572Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32-shippable.opt.3261117199127843984.1581467802.1581471422.live_backing.log-[vcs 2020-02-12T00:37:13.072Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32-shippable.opt.6474424353456821430.1581467802.1581471405.live_backing.log-[vcs 2020-02-12T00:36:46.062Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux32.debug.4304945460674466998.1581464407.1581469829.live_backing.log-[vcs 2020-02-11T23:40:42.224Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64-shippable.opt.1087673278437182443.1581487118.1581490738.live_backing.log-[vcs 2020-02-12T05:59:11.115Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64-shippable.opt.3528686596447447834.1581468580.1581472197.live_backing.log-[vcs 2020-02-12T00:49:59.803Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.asan.7999058355361287261.1581528590.1581534008.live_backing.log-[vcs 2020-02-12T17:30:10.428Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.5474995518196054417.1581528806.1581534223.live_backing.log-[vcs 2020-02-12T17:33:45.525Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.6474424353456821430.1581462018.1581467436.live_backing.log-[vcs 2020-02-11T23:00:49.183Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.6911032307653124192.1581528426.1581533829.live_backing.log-[vcs 2020-02-12T17:27:19.654Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.7080638530357990544.1581465591.1581468294.live_backing.log-[vcs 2020-02-12T00:00:03.008Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.linux64.debug.7822863695774066644.1581528638.1581534058.live_backing.log-[vcs 2020-02-12T17:31:10.087Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-cross-devedition.opt.2893510033435132656.1581504717.1581508336.live_backing.log-[vcs 2020-02-12T10:52:28.703Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-cross-devedition.opt.5135040868078535476.1581455973.1581463192.live_backing.log-[vcs 2020-02-11T21:20:03.480Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-cross-devedition.opt.6049069488453583567.1581504716.1581508333.live_backing.log-[vcs 2020-02-12T10:52:22.036Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-cross.debug.8637672537182855361.1581513689.1581519107.live_backing.log-[vcs 2020-02-12T13:21:58.447Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.2743940490296478580.1581507154.1581510772.live_backing.log-[vcs 2020-02-12T11:32:53.900Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.3725644940419437750.1581467841.1581471464.live_backing.log-[vcs 2020-02-12T00:37:52.959Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.4632915189236143286.1581467841.1581471462.live_backing.log-[vcs 2020-02-12T00:37:46.983Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.7155074290737386678.1581508365.1581511968.live_backing.log-[vcs 2020-02-12T11:52:48.995Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.osx-shippable.opt.8944846970400183478.1581467841.1581472884.live_backing.log-[vcs 2020-02-12T01:01:36.837Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-32-shippable.opt.8598930111106808623.1581507178.1581514396.live_backing.log-[vcs 2020-02-12T11:33:21.566Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-32-shippable.opt.8911141639019205814.1581507161.1581514363.live_backing.log-[vcs 2020-02-12T11:32:43.967Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-64-shippable.opt.7080638530357990544.1581468318.1581475520.live_backing.log-[vcs 2020-02-12T00:45:20.949Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-aarch64-devedition.opt.6464653165419460976.1581507671.1581514888.live_backing.log-[vcs 2020-02-12T11:41:39.748Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
1.windows2012-aarch64-shippable.opt.3160669768184135721.1581471325.1581478952.live_backing.log-[vcs 2020-02-12T01:42:37.435Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
2.linux64.opt.2413071487011296400.1581467704.1581471322.live_backing.log-[vcs 2020-02-12T00:35:33.303Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
2.linux64.opt.8555794478947795067.1581480514.1581484117.live_backing.log-[vcs 2020-02-12T04:08:37.521Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)

However, some of those are windows platform.

Misc others in the group have task or other actions before the timeout:

~/dev/releng$ grep -B2 "Task timeout" *live_backing.log | grep -Ev '(-$|Task timeout|clon)'
1.android-em-7-0-x86_64.debug.machine-39.1581435343.1581438050.live_backing.log-[task 2020-02-11T16:19:23.183Z] 16:19:23     INFO - Using adb 1.0.41
1.android-em-7-0-x86_64.opt.machine-22.1581314833.1581317537.live_backing.log-[task 2020-02-10T06:50:58.092Z] 06:50:58     INFO - Using adb 1.0.41
1.linux32-shippable.opt.3997768055339783382.1581467968.1581471570.live_backing.log-[fetches 2020-02-12T00:39:39.296Z] Removing /builds/worker/fetches/upx.tar.xz
1.linux32.debug.i-0c6ece209a02cea52.1581458517.1581463958.live_backing.log-[task 2020-02-11T23:32:33.894Z] 23:32:33     INFO -  package-tests> make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
1.linux32.opt.i-0886d3918088c8ecc.1581356381.1581359139.live_backing.log-[task 2020-02-10T18:20:12.245Z] 18:20:12     INFO -  make[1]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/testing/geckodriver'
1.linux64.debug.i-0ada66834e3727eff.1581358813.1581364238.live_backing.log-[task 2020-02-10T19:49:06.469Z] 19:49:06     INFO -  package-tests> make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
1.linux64.debug.i-0dacdcbc862638ee4.1581425506.1581430922.live_backing.log-[task 2020-02-11T14:21:46.248Z] 14:21:46     INFO -  make[1]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
1.linux64.debug.i-0f9befaaf5ea438ad.1581376730.1581382150.live_backing.log-[task 2020-02-11T00:48:57.749Z] 00:48:57     INFO -  buildsymbols> make[2]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
1.osx-cross.debug.i-07e77e82213a1381a.1581369251.1581374711.live_backing.log-[task 2020-02-10T22:24:01.341Z] 22:24:01     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/ipc/app'
2.linux1804-64-asan.opt.i-0caf4f36ae11f20f5.1581377613.1581384816.live_backing.log-[task 2020-02-11T01:33:33.805Z] 01:33:33     INFO - Closing window 39
2.linux1804-64-asan.opt.i-0d193853be308d9ae.1581528813.1581536241.live_backing.log-[task 2020-02-12T19:37:09.790Z] 19:37:09     INFO - PID 10837 | 1581536229785	addons.webextension.screenshots@mozilla.org	WARN	Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: about:reader*
2.linux1804-64-asan.opt.i-0fb1a4be8fdf74831.1581419024.1581426659.live_backing.log-[task 2020-02-11T13:10:57.649Z] 13:10:57     INFO - TEST-START | /_mozilla/dom/classList.html
2.linux1804-64-ccov.opt.i-06c2bcc4d2abf27f8.1581515486.1581520888.live_backing.log-[task 2020-02-12T15:20:03.491Z] 15:20:03     INFO -  TEST-START | dom/push/test/xpcshell/test_clear_forgetAboutSite.js
2.symbols.opt.1142744006236503768.1581465773.1581469386.live_backing.log-aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host msdl.microsoft.com:443 ssl:default [None]

(In reply to Dave House [:dhouse] from comment #42)

Reviewing the timed-out tasks from https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358, they mostly hit the same failures as the 3 examples in comment 0; interrupted clone, or "cloning" message and then nothing.

Nothing appears significant to me from these listings and counts across worker types (various workers, tasks, and times).

Since the logs show the same interruptions on windows workers, I wonder if it is a problem happening there also (and if we are running builds there, then maybe there are more tasks to review).

(In reply to Wander Lairson Costa from comment #37)

Your description is pretty accurate, I believe the change also include deleting the current b-linux worker pool in ci-configuration, as it is configured for AWS. 302 to :tomprince to see if he sees something wrong.

:tomprince, could you guide me on how to send tasks through for only the gecko-1/b-linux-gcp workerType? Can I locally commit a change to taskcluster/ci/build/linux.yml (s/b-linux/b-linux-gcp/g) and mach try with some build tasks, or is there a better, or actual, way to use linux-gcp/debug (https://hg.mozilla.org/mozilla-central/file/b7a8b4961e01b8a38be141bb89cc93f1ad404e4e/taskcluster/ci/build/linux.yml#l1350)?

Flags: needinfo?(mozilla)
Flags: needinfo?(mozilla)

You can select the -gcp builds from mach try fuzzy --full. Or, you can pass --worker-suffix b-linux=-gcp to most mach try selectors, to run everything that would run on gecko-1/b-linux on gecko-1/b-linux-gcp instead; doing the later won't select particular tasks, but any that are selected will run on that worker.

There is also --worker-override <alias>=<worker-pool> if you need more control, but most uses can use the suffix version.

Flags: needinfo?(mozilla)

I kicked off ~400 builds. treeherder shows them at 87% complete, and no failures. I see 329 active gce instances. https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c3cd02d3533602e6a08b200973e335065f27fa5

If there are failures in this group, then it will be simpler to track down. If there are none, then I think it may be a higher load or specific timing issue, and I can try a burst of tasks to reach the failure level and review timing (what else is happening).

I also want to expand my earlier search on timeouts to see if there are timeouts with the same log-cut-off behavior that I am expecting is a symptom of the problem. If it has been happening on regular tier3 builds, and on windows tasks, then it is not likely a problem specific to the linux workers.

(In reply to Dave House [:dhouse] from comment #46)

I kicked off ~400 builds. treeherder shows them at 87% complete, and no failures. I see 329 active gce instances. https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c3cd02d3533602e6a08b200973e335065f27fa5

One build failed but not with a timeout. Instead it failed with a connection reset message:

[vcs 2020-02-28T05:44:48.170Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '6c3cd02d3533602e6a08b200973e335065f27fa5', 'https://hg.mozilla.org/try', '/builds/worker/workspace/build/src']
[vcs 2020-02-28T05:44:48.255Z] (using Mercurial 4.8.1)
[vcs 2020-02-28T05:44:48.255Z] ensuring https://hg.mozilla.org/try@6c3cd02d3533602e6a08b200973e335065f27fa5 is available at /builds/worker/workspace/build/src
[vcs 2020-02-28T05:44:48.795Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2020-02-28T05:44:48.902Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2020-02-28T05:44:49.559Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/2c1dd534e937c92b4c5965892212d52d9b8a58f8.stream-v2.hg
[vcs 2020-02-28T05:44:49.628Z] 564092 files to transfer, 3.21 GB of data
[vcs 2020-02-28T05:44:51.630Z] 
[vcs 2020-02-28T05:44:52.629Z] clone [=>                                           ]  190404715/3449400926 35s
[vcs 2020-02-28T05:44:53.629Z] clone [=>                                           ]  228307637/3449400926 43s
[vcs 2020-02-28T05:44:54.629Z] clone [==>                                          ]  281860902/3449400926 45s
[vcs 2020-02-28T05:44:55.972Z] clone [===>                                         ]  339920859/3449400926 46s
[vcs 2020-02-28T05:44:55.972Z]                                                                                 
[vcs 2020-02-28T05:44:55.972Z] PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["projects/887720501152/machineTypes/custom-32-73728"], "hgVersion": "4.8.1", "lowerIsBetter": true, "name": "clone_errored", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 7.1731109619140625}, {"extraOptions": ["projects/887720501152/machineTypes/custom-32-73728"], "hgVersion": "4.8.1", "lowerIsBetter": true, "name": "overall", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 7.713670969009399}, {"extraOptions": ["projects/887720501152/machineTypes/custom-32-73728"], "hgVersion": "4.8.1", "lowerIsBetter": true, "name": "overall_clone", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 7.713670969009399}, {"extraOptions": ["projects/887720501152/machineTypes/custom-32-73728"], "hgVersion": "4.8.1", "lowerIsBetter": true, "name": "overall_clone_fullcheckout", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 7.713670969009399}]}
[vcs 2020-02-28T05:44:55.973Z] abort: Connection reset by peer
[taskcluster 2020-02-28 05:44:56.491Z] === Task Finished ===
[taskcluster 2020-02-28 05:44:56.702Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2020-02-28 05:44:56.892Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2020-02-28 05:44:57.107Z] Unsuccessful task run with exit code: 255 completed in 35.76 seconds

That looks very similar to the failures-- a log of clone that does not show completion. The PERFHERDER_DATA entry and following abort is different and immediately after a clone progress log entry. So It looks like a different problem (not hung/timeout but interrupted).

Rob, have you seen hg clone timeouts on windows builders in GCP? There are Windows machines also in the failures Feb 10-12 when we had timeouts on linux builds: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358
But I did not find any in current papertrail logs. So something is likely different then from now (possibly the mass of linux workers not running). If you have seen this problem at other times, do you know about the cause or circumstances?

Flags: needinfo?(rthijssen)

Connor, do you know of mercurial debugging settings or logging that can show more details during a clone if the clone is stalling? I have not investigated what is available or read the mercurial source yet.

I've tried reproducing the failures (with a batch of 400 builds) and that did not give me a timeout/stall. So it looks like I will need to create a larger load; I'm manually setting up a load test of the clone on a small pool of instances to see if I can get a stall/failure that way without needing to spin up a high number of worker instances.

Flags: needinfo?(sheehan)

(In reply to Dave House [:dhouse] from comment #49)

Connor, do you know of mercurial debugging settings or logging that can show more details during a clone if the clone is stalling? I have not investigated what is available or read the mercurial source yet.

I've tried reproducing the failures (with a batch of 400 builds) and that did not give me a timeout/stall. So it looks like I will need to create a larger load; I'm manually setting up a load test of the clone on a small pool of instances to see if I can get a stall/failure that way without needing to spin up a high number of worker instances.

These are running robustcheckout. I am testing with just hg --debug robustcheckout [...] (I still need to produce the problem, but I want to catch some debug logs or something when I get it reproduced).

They are using an attached 375gb disk:

Disk /dev/nvme0n1: 375 GiB, 402653184000 bytes, 98304000 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/mapper/instance_storage-all: 375 GiB, 402648989696 bytes, 98302976 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
$ sudo pvs
  PV           VG               Fmt  Attr PSize    PFree
  /dev/nvme0n1 instance_storage lvm2 a--  <375.00g    0 
$ sudo vgdisplay
  --- Volume group ---
  VG Name               instance_storage
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <375.00 GiB
  PE Size               4.00 MiB
  Total PE              95999
  Alloc PE / Size       95999 / <375.00 GiB
  Free  PE / Size       0 / 0   
  VG UUID               tQ2n7Y-MgpA-YzgN-Wqpd-u4HT-qURe-SGkPBB

These instances are running with 10gb disks and 95gb memory.

WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#performance.

This is only for the system boot disk. Poor i/o because of a small disk (shared) should not affect us because the clone is going into memory and the attached 375gb disk.

(In reply to Dave House [:dhouse] from comment #49)

Connor, do you know of mercurial debugging settings or logging that can show more details during a clone if the clone is stalling? I have not investigated what is available or read the mercurial source yet.

I've tried reproducing the failures (with a batch of 400 builds) and that did not give me a timeout/stall. So it looks like I will need to create a larger load; I'm manually setting up a load test of the clone on a small pool of instances to see if I can get a stall/failure that way without needing to spin up a high number of worker instances.

You can use --traceback --verbose --debug on most commands, which should add debugging output and tell Mercurial to print the stack trace should it encounter an error. Those are global Mercurial flags so they should work with robustcheckout. You could also use --config ui.debug=true and see if that adds more output (though IIRC it is equivalent to --debug). There is also progress.debug=true for debugging the progress bar - I'm not sure how useful this will be but since we seem to be losing logs of the progress bar I thought it would be worth mentioning. YMMV.

There is also the extremely useful builtin blackbox extension which will create a log within the cloned repository (ie at repo/.hg/blackbox.log). If you run the clone command like hg --config extensions.blackbox= --verbose --traceback --debug robustcheckout ... all debugging output will make it to that log. You can see some of the other config options for blackbox by running hg --config extensions.blackbox= help -e blackbox.

Hope this helps. :)

Flags: needinfo?(sheehan)

(In reply to Cosmin Sabou [:CosminS] from comment #1)

This is blocking reopening autoland cause the failures rate is still high and there can't be test coverage this way.

:CosminS, can you tell me what the rate of failures was? I looked around but I don't know where to find that in treeherder/other; I can look at machine types or check tasks in activedata, but I'm wondering if you know an easier way to find the counts.

Flags: needinfo?(csabou)

(In reply to Connor Sheehan [:sheehan] from comment #52)

Hope this helps. :)

Thank you Connor! This gives me more of the information I was hoping to find, and I'm hoping (when I can reproduce it) that the progress debug and traceback catch something. I'm testing to see if I get a traceback from mercurial (with these options) when I issue a hangup, and if I get the same from a task-run timeout.

(In reply to Dave House [:dhouse] from comment #53)

(In reply to Cosmin Sabou [:CosminS] from comment #1)

This is blocking reopening autoland cause the failures rate is still high and there can't be test coverage this way.

:CosminS, can you tell me what the rate of failures was? I looked around but I don't know where to find that in treeherder/other; I can look at machine types or check tasks in activedata, but I'm wondering if you know an easier way to find the counts.

Dave, I would gladly give you a percentage and exact numbers but unfortunately I cannot do that. I can show you how it looked on trees, especially on autoland where there were a busy night with lots of pushes and failures/bustages overlapping.
This is what was starred as expected fail against this bug: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&tochange=39fd9ff27ad869a1d0168ac557b8c54d6a142515&fromchange=be916959142fc9ed315e855e2dce4b57fcac8e05&failure_classification_id=3
There were many more failures to come in next hrs and it's only a range with pushes affected by this.
Maybe Aryx could help with a telemetry query for what was starred against this bug as expected fail and have more appropriate numbers. Thank you.

Flags: needinfo?(csabou) → needinfo?(aryx.bugmail)

(In reply to Cosmin Sabou [:CosminS] from comment #55)

(In reply to Dave House [:dhouse] from comment #53)

(In reply to Cosmin Sabou [:CosminS] from comment #1)

This is blocking reopening autoland cause the failures rate is still high and there can't be test coverage this way.

:CosminS, can you tell me what the rate of failures was? I looked around but I don't know where to find that in treeherder/other; I can look at machine types or check tasks in activedata, but I'm wondering if you know an easier way to find the counts.

Dave, I would gladly give you a percentage and exact numbers but unfortunately I cannot do that. I can show you how it looked on trees, especially on autoland where there were a busy night with lots of pushes and failures/bustages overlapping.
This is what was starred as expected fail against this bug: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&tochange=39fd9ff27ad869a1d0168ac557b8c54d6a142515&fromchange=be916959142fc9ed315e855e2dce4b57fcac8e05&failure_classification_id=3
There were many more failures to come in next hrs and this is only a range with pushes affected by this.
Maybe Aryx could help with a telemetry query for what was starred against this bug as expected fail and have more appropriate numbers. Thank you.

THank you this is helpful even to know I'm not overlooking an obvious way to get the numbers. I started working off of some of Aryx's queries in redash to track this down, and I think I have the data I need (timeout (thx to treeherder text_log_error) and total counts).

I want to give numbers for this to google support and it may help me in figuring out how to reproduce the failure.

(In reply to Dave House [:dhouse] from comment #48)

Rob, have you seen hg clone timeouts on windows builders in GCP? There are Windows machines also in the failures Feb 10-12 when we had timeouts on linux builds: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358
But I did not find any in current papertrail logs. So something is likely different then from now (possibly the mass of linux workers not running). If you have seen this problem at other times, do you know about the cause or circumstances?

the windows build log i checked, makes it look like the instance was lost in the middle of the clone operation. because the instance just stops creating log output mid clone. later the queue resolves the task as a timeout.

Flags: needinfo?(rthijssen)

(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #57)

(In reply to Dave House [:dhouse] from comment #48)

Rob, have you seen hg clone timeouts on windows builders in GCP? There are Windows machines also in the failures Feb 10-12 when we had timeouts on linux builds: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-02-09&endday=2020-02-12&tree=all&bug=1411358
But I did not find any in current papertrail logs. So something is likely different then from now (possibly the mass of linux workers not running). If you have seen this problem at other times, do you know about the cause or circumstances?

the windows build log i checked, makes it look like the instance was lost in the middle of the clone operation. because the instance just stops creating log output mid clone. later the queue resolves the task as a timeout.

Thanks! That is similar to linux because the task log just stops without an error or anything (perhaps it is just the log upload that is terminated/killed). I thought the log is resolved on the instance and that task process just stops creating log output. BUT, I need to verify that is happening and that it isn't the queue resolving them as timeout.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #58)

List of tasks classified as this bug: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-11-07&endday=2020-03-06&tree=all&bug=1614852

Thanks :Aryx! I'll review these (i was looking at 1411358 before). I am also collecting all tasks with timeouts Feb2-24+ and checking if we had any repeats outside the problem window Feb10-12: https://sql.telemetry.mozilla.org/queries/68967/source#table with the few gcp workers we have there.

I've tested over the additional debugging options. Also I made a minimal depth walk through the mercurial source to confirm the process according to the logs (to review when/if the problem is reproduced). Example start of clone with gcs bundle:

(using Mercurial 4.8.1)
ensuring https://hg.mozilla.org/try@6c3cd02d3533602e6a08b200973e335065f27fa5 is available at /builds/worker/workspace/build/src
using https://hg.mozilla.org/mozilla-unified
sending capabilities command
hg.mozilla.org certificate matched fingerprint sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48
sending lookup command
sending 5 bytes
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
sending lookup command
sending 5 bytes
(sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
using https://hg.mozilla.org/mozilla-unified
sending capabilities command
hg.mozilla.org certificate matched fingerprint sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48
sending clonebundles command
applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/14caf2826f00fc609f7fe55d8df14e7e08fd1ced.stream-v2.hg
bundle2-input-bundle: with-transaction
bundle2-input-part: "stream2" (params: 3 mandatory) supported
applying stream bundle
564985 files to transfer, 3.22 GB of data
[...]

Parsed as stream2 (https://www.mercurial-scm.org/repo/hg-stable/file/4.8.1/mercurial/bundle2.py#l499), matches bundle header:

$ curl -r 0-200 -o - "https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/14caf2826f00fc609f7fe55d8df14e7e08fd1ced.stream-v2.hg" | hexdump -C
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   201  100   201    0     0    862      0 --:--:-- --:--:-- --:--:--   858
00000000  48 47 32 30 00 00 00 00  00 00 00 77 07 53 54 52  |HG20.......w.STR|
00000010  45 41 4d 32 00 00 00 00  03 00 09 0a 09 06 0c 35  |EAM2...........5|
00000020  62 79 74 65 63 6f 75 6e  74 33 34 35 39 34 30 32  |bytecount3459402|
00000030  33 31 33 66 69 6c 65 63  6f 75 6e 74 35 36 34 39  |313filecount5649|
00000040  38 35 72 65 71 75 69 72  65 6d 65 6e 74 73 64 6f  |85requirementsdo|
00000050  74 65 6e 63 6f 64 65 25  32 43 66 6e 63 61 63 68  |tencode%2Cfncach|
00000060  65 25 32 43 67 65 6e 65  72 61 6c 64 65 6c 74 61  |e%2Cgeneraldelta|
00000070  25 32 43 72 65 76 6c 6f  67 76 31 25 32 43 73 74  |%2Crevlogv1%2Cst|
00000080  6f 72 65 00 00 80 00 73  11 c3 03 64 61 74 61 2f  |ore....s...data/|
00000090  2e 61 72 63 63 6f 6e 66  69 67 2e 69 00 03 00 01  |.arcconfig.i....|
000000a0  00 00 00 00 00 00 00 44  00 00 00 46 00 00 00 00  |.......D...F....|
000000b0  00 06 b9 74 ff ff ff ff  ff ff ff ff 78 81 e6 c2  |...t........x...|
000000c0  0d 7c 38 d6 37 a3 ab 7c  8b                       |.|8.7..|.|
000000c9

To consider later, hg 4.8.1 is from Nov 2018. If this is caused by a bug in mercurial, it may be fixed in the current stable or current version. We might also look at the version of extensions (such as robustcheckout) we're using.

A timed-out task is forcably killed by docker-worker (https://github.com/taskcluster/docker-worker/blob/b089a1dbf5aed5b9cd1f70fcf25f1e92ff05a15a/src/lib/task.js#L680 example: https://firefox-ci-tc.services.mozilla.com/tasks/NymXXRv4RrWSDbHGwV3CtQ/runs/0/logs/live/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FNymXXRv4RrWSDbHGwV3CtQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log). So there is no chance for mercurial to log its trace or handle the kill. So, to get a trace and logs, I need to perform the timeout and hangup within the task. The task is not given its timeout as a variable, but I can pick a reasonable time for the download since the task timeouts shown in the failure logs are all 1800s+ (mostly 3600, 5400, 7200) and the bundle download time ranges 100-300s, most around 180s, in my testing.

more, re: the force kill:

(In reply to Dave House [:dhouse] from comment #62)

A timed-out task is forcably killed by docker-worker (https://github.com/taskcluster/docker-worker/blob/b089a1dbf5aed5b9cd1f70fcf25f1e92ff05a15a/src/lib/task.js#L680

in this call, we aren't getting the kill result or if the container still existed. The function returns a promise that resolves with the result.
uses dockerode, https://github.com/apocas/dockerode/blob/f5606ed233625bdd1a9a9be4458e439f9040aac1/lib/container.js#L637,
uses the api call /containers/container_id/kill
which issues SIGKILL by default: https://docs.docker.com/engine/api/v1.40/#operation/ContainerKill

also explained in the comment: https://github.com/taskcluster/docker-worker/blob/b089a1dbf5aed5b9cd1f70fcf25f1e92ff05a15a/src/lib/task.js#L677

the run function which is waiting on the container run to finish (issues start and then wait): https://github.com/taskcluster/docker-worker/blob/b089a1dbf5aed5b9cd1f70fcf25f1e92ff05a15a/src/lib/task.js#L929 - https://github.com/apocas/dockerode/blob/f5606ed233625bdd1a9a9be4458e439f9040aac1/lib/docker.js#L1464

I opened a case with google support, 22377255. I've had 3 back-and-forth emails so far, and the details that may be relevent to the problem/cause so far are:

While inspecting internal resources from February the 10th through the 12th for project “fxci-production-level1-workers, I see you had 247 Compute Engine VMs on February the 12th around 18:00 PST time consuming 953 mbps internal network traffic at the highest peak. As far as disk issues, nics, etc… I was not able to locate any issues other than the 953 mbps of internal network traffic between your two GCP projects.

After inspecting project “fxci-production-level3-workers” during this time , you had 704 Compute Engine VMs at your highest peak on February 12 around 01:00 PST in the us-central1 region 422 mbps and no disk problems, nics, etc… I also suspect high internal network traffic.

Following [1], there is some competition between network traffic and persistent disk traffic, so this may be the issue at hand. There potentially could have been some instability due to the high amount of usage. My next step will be to investigate this hunch to verify.You can expect an update no later than 6pm PST tomorrow.

[1] https://cloud.google.com/compute/docs/disks/performance#egress_performance_cap

GCP VM disk i/o is on the same network as the vm's nic. And that disk i/o is capped based on vm size.

Reading this, I am curious if it is possible to lock up a system with combined high network and disk i/o. They *should just be throttled.

Re: the docker kill on timeout, we don't know the container state (only that we've reached a timeout without task end, and logs were truncated). I'm still thinking there may be something more with that and not with network/disk as a cause; It may be that some of these instances were hung or the logs were disconnected. Perhaps when a disk write is throttled, the clone stalls and does not resume when the disk is ready (on the gcp vm, or a bug with the lvm/config/something).

(In reply to Dave House [:dhouse] from comment #64)

GCP VM disk i/o is on the same network as the vm's nic. And that disk i/o is capped based on vm size.

Reading this, I am curious if it is possible to lock up a system with combined high network and disk i/o. They *should just be throttled.

I'm fairly certain this is not our problem. The support agent asked about the attached disk, and on checking it is local SSD so I've asked is that is throttled with network (as total data in/out of the vm). I'm assuming it is not and that the compete between network and disk is for persistent disks only and not local ssd (if I remember my gce work from years ago, that was the case then) unless there is throttling for all vm data in/out.

I downloaded logs for all timeout tests from jan 1st through mar 11. The timeouts during cloning appear on gcp only from feb 5th through 13th. I pushed those numbers, grouped by tier1/2/3, to influx to compare with gc monitoring data: https://earthangel-b40313e5.influxcloud.net/d/9vTbVNXWk/gcp-hg-timeouts?orgId=1&from=1581453741293&to=1581594479200 That has shown me is that we had more timeouts when we had more tasks running, but also that we had timeouts throughout that period, only increasing/decreasing based on the number of instances. So, it does not look like it was caused by the scaling, number of instances.

I've noticed the same timeouts in aws (the logging matches with a hg clone stalled/interrupted and then much later the task is timed-out terminated). So I'm reviewing that to check when those happen, verifying they are real (?!), and how frequently.

(In reply to Dave House [:dhouse] from comment #65)

I've noticed the same timeouts in aws (the logging matches with a hg clone stalled/interrupted and then much later the task is timed-out terminated). So I'm reviewing that to check when those happen, verifying they are real (?!), and how frequently.

The hg clone timeouts on tasks in aws span from 2019-11-13 (This is when the treeherder db started recording "text_log_error.line". so I don't have history beyond that date and can assume the occurrences continued before that date) until today.

4 happened yesterday on aws. They don't hit with the same volume when there is scaling as we saw on gcp. So that is weird if it is regularly intermittent, we should see it increase when we're scaled up in aws the same as we saw in gcp (maybe we do at a lower scale -- checking scaling numbers). BUT, I should also have seen it within my test of 400 tasks on gcp if it was simply 1 in N chances of hitting it. So it seems like it is a problem in something that is not a function of the instance scaling -- and that we saw it more when scaling because more instances within the problem occurrence windows were hitting it: A-B-C with B scaled up, showing problem in C. The difference on AWS could be that the problem thing recovers more quickly there (maybe GCS/S3 bucket or something like that).

I should be able to reproduce it in aws because it is ongoing there; that, or I could catch an instance that hits the problem in aws before it is terminated. However, reproducing in aws may show a different cause than gcp; I can look back at this if I cannot reproduce it in gcp as it may show some relative cause (like s3 vs gcs).

I'm going to try a re-try hg clone for the next few days in gcp to see if I get a hit of the timeout. This will be a few instances running 24xN. So the cost is minimal compared with testing at the scale of 700+ instances.

(In reply to Dave House [:dhouse] from comment #66)

(In reply to Dave House [:dhouse] from comment #65)

I've noticed the same timeouts in aws (the logging matches with a hg clone stalled/interrupted and then much later the task is timed-out terminated). So I'm reviewing that to check when those happen, verifying they are real (?!), and how frequently.

The hg clone timeouts on tasks in aws span from 2019-11-13 (This is when the treeherder db started recording "text_log_error.line". so I don't have history beyond that date and can assume the occurrences continued before that date) until today.

Connor, what do you think could cause the mercurial client to do this? (stopping with no logging; we don't get a trace because the container is killed/dropped). I think it may be something like network drops that the mercurial client is not catching and reaches a bad state or keeps waiting.

I found we see the same problem on aws (back as far as I could track in treeherder's db, to Nov 13 '19), but not at the same volume as we hit in gcp. Here are some recent times for the timeout on aws workers (next log line for each was a task-kill timeout message):

logs/1.windows2012-32-shippable.opt.i-003e9237b981e7620.1579914066.1579921281
[vcs 2020-01-25T01:01:20.019Z] (cloning from upstream repo https://us-east-1.hgmointernal.net/mozilla-unified)
logs/1.android-5-0-x86_64.asan.i-003e9237b981e7620.1579921779.1579929001
[vcs 2020-01-25T03:09:59.160Z] (cloning from upstream repo https://us-east-1.hgmointernal.net/mozilla-unified)
logs/1.linux32-shippable.opt.i-0f4fa7f012c0d90ed.1580775450.1580779748
[vcs 2020-02-04T01:28:36.327Z] updating [============================================>   ] 267700/285445 2m29s
logs/1.linux32-shippable.opt.i-0df4ed6a331e2ebfc.1580778092.1580779769
[vcs 2020-02-04T01:29:19.699Z] clone [===========================================> ] 3398433941/3419233525 07s
logs/1.linux64-shippable.opt.i-0df4ed6a331e2ebfc.1580780301.1580786330
[vcs 2020-02-04T01:52:42.438Z] (cloning from upstream repo https://us-east-1.hgmointernal.net/mozilla-unified)
logs/1.osx-shippable.opt.i-02139cfcb8d9fc133.1581079622.1581080483
[vcs 2020-02-07T12:57:23.451Z] clone [==========================================>  ] 3323112934/3424341814 07s
logs/1.osx-shippable.opt.i-0c85a77090dcecc5b.1581087228.1581088610
[vcs 2020-02-07T15:16:43.655Z] clone [===========================================> ] 3419210782/3424341814 03s
Flags: needinfo?(sheehan)

shoot, I was looking at the last month. here is a March 6th example on aws:

logs/2.symbols.opt.i-06d078ecaec590c56.1583453056.1583454439
[vcs 2020-03-06T00:27:08.674Z] clone [==========================================>  ] 3354366474/3457790664 35s

(In reply to Dave House [:dhouse] from comment #67)

Connor, what do you think could cause the mercurial client to do this? (stopping with no logging; we don't get a trace because the container is killed/dropped). I think it may be something like network drops that the mercurial client is not catching and reaches a bad state or keeps waiting.

I've got no idea, to be honest. Mercurial does have issues with flaky networks, but usually this results in an error message/traceback indicating what went wrong. We've had someone on our team (Mitch) running hg clones with a purposely bad environment attempting to replicate some of the issues we've been seeing (see https://bugzilla.mozilla.org/show_bug.cgi?id=1291926#c101 for his first attempts to try and reproduce a network failure).

If we're always losing the log output during the clone while downloading a clone bundle, the first place to look for issues in upstream Mercurial would be this function. You can see this function is called directly between the status messages "applying clone bundle from XXX" and "finished applying clone bundle" here. It may be worth trying to run some clones with --profile and seeing if anything useful comes of that.

It's interesting that we've hit similar errors in AWS as well. Can you provide some URLs to tasks in Treeherder or Taskcluster so I can take a closer look? It's odd that we don't see this issue as much in AWS (this is my first time hearing of it). Maybe there's some interaction going on here, like hg has issues with shoddy connections, and GCS is less reliable for downloads than S3, so we see the issue show up much more frequently in GCP.

Flags: needinfo?(sheehan) → needinfo?(dhouse)

other thread/topic, continuing to try to reproduce in gcp:

Reviewing the boot disk images, I see there is one updated on Feb12th.
The last timeout on GCP was Feb 12th at 10:12am UTC (2:12am pacific).
The image was created Feb 13th 03:32 UTC ("2020-02-12T19:32:59.061-08:00")

This could be why I could not reproduce the problem since I was using this new image (testing on current -gcp workers through try). (++humility for not noticing this sooner)

I'm testing next with the old image and checking metrics+logs for what images were used when.

(In reply to Connor Sheehan [:sheehan] from comment #69)

I've got no idea, to be honest. Mercurial does have issues with flaky networks, but usually this results in an error message/traceback indicating what went wrong. We've had someone on our team (Mitch) running hg clones with a purposely bad environment attempting to replicate some of the issues we've been seeing (see https://bugzilla.mozilla.org/show_bug.cgi?id=1291926#c101 for his first attempts to try and reproduce a network failure).

If we're always losing the log output during the clone while downloading a clone bundle, the first place to look for issues in upstream Mercurial would be this function. You can see this function is called directly between the status messages "applying clone bundle from XXX" and "finished applying clone bundle" here. It may be worth trying to run some clones with --profile and seeing if anything useful comes of that.

Thanks! I'll check these out. I think we're losing any logs or trace because we're force killing off the container and processes instead of sending a hangup/etc (My testing showed I lost any trace and trailing logs when I killed off a container running hg).

It's interesting that we've hit similar errors in AWS as well. Can you provide some URLs to tasks in Treeherder or Taskcluster so I can take a closer look? It's odd that we don't see this issue as much in AWS (this is my first time hearing of it). Maybe there's some interaction going on here, like hg has issues with shoddy connections, and GCS is less reliable for downloads than S3, so we see the issue show up much more frequently in GCP.

flakiness with GCS vs. S3 is my leading hypothesis for now :)

Here are treeherder links for some aws failures:

https://firefox-ci-tc.services.mozilla.com/tasks/WXkMwluDQomiGL3kfVUWPw/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FWXkMwluDQomiGL3kfVUWPw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log

[taskcluster 2020-03-06 00:04:16.435Z] Task ID: WXkMwluDQomiGL3kfVUWPw
[taskcluster 2020-03-06 00:04:16.435Z] Worker ID: i-06d078ecaec590c56
[taskcluster 2020-03-06 00:04:16.435Z] Worker Group: aws
[taskcluster 2020-03-06 00:04:16.435Z] Worker Node Type: c5.4xlarge
[taskcluster 2020-03-06 00:04:16.435Z] Worker Type: b-linux
[taskcluster 2020-03-06 00:04:16.435Z] Public IP: 18.205.41.175
[taskcluster 2020-03-06 00:04:16.435Z] Hostname: ip-10-145-96-104
[taskcluster 2020-03-06 00:04:16.435Z] using cache "gecko-level-3-checkouts-sparse-v3-f730704b60167efec983" -> /builds/worker/checkouts

https://firefox-ci-tc.services.mozilla.com/tasks/BmWnYCLCRBGe_yT99Vtpjg/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FBmWnYCLCRBGe_yT99Vtpjg%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log

https://firefox-ci-tc.services.mozilla.com/tasks/eRMLw3o5SECh77R5D7fYeQ/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FBmWnYCLCRBGe_yT99Vtpjg%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log

https://firefox-ci-tc.services.mozilla.com/tasks/NOkGWQwyQFaeR5Zx4n92gw/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FNOkGWQwyQFaeR5Zx4n92gw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log

https://firefox-ci-tc.services.mozilla.com/tasks/ercwQs7LQ2WIaiCcbvukcQ/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FNOkGWQwyQFaeR5Zx4n92gw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log

These look like the download was slow (not a long time gap between last log line and the timeout kill), but not with the huge time gap we saw on gcp.
So it doesn't look as similar as I thought. The gcp logs timeout during clone also, but with a long time gap like:

[vcs 2020-02-13T03:12:52.225Z] ensuring https://hg.mozilla.org/integration/autoland@2f205d029f748eb1ac47ea1057511d1c1f4f7754 is available at /builds/worker/checkouts/gecko
[vcs 2020-02-13T03:12:52.879Z] (shared store has an active lock; assuming it is left over from a previous process and that the store is corrupt; deleting store and destination just to be sure)
[vcs 2020-02-13T03:13:01.474Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)

[taskcluster:error] Task timeout after 1800 seconds. Force killing container.
[taskcluster 2020-02-13 04:08:01.001Z] === Task Finished ===

or

[vcs 2020-02-12T21:04:30.521Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/2a2b0b1b3be697dfbabe4dac69304c91878f28ef.stream-v2.hg
[vcs 2020-02-12T21:04:30.583Z] 562193 files to transfer, 3.19 GB of data
[vcs 2020-02-12T21:04:32.586Z] 
[vcs 2020-02-12T21:04:33.585Z] clone [>                                            ]  144006766/3427138758 46s
[...]
[vcs 2020-02-12T21:05:23.629Z] clone [=======================>                     ] 1870442547/3427138758 44s

[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2020-02-12 22:34:25.861Z] === Task Finished ===
Flags: needinfo?(dhouse)

(In reply to Dave House [:dhouse] from comment #70)

other thread/topic, continuing to try to reproduce in gcp:

Reviewing the boot disk images, I see there is one updated on Feb12th.
The last timeout on GCP was Feb 12th at 10:12am UTC (2:12am pacific).
The image was created Feb 13th 03:32 UTC ("2020-02-12T19:32:59.061-08:00")

The image being used by a worker is configured in https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-images.yml
(for apply by ci-admin to worker-manager).

The image history based on the commits (hgmo push time, UTC) is:
docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z 2020-01-14 21:24:47
docker-worker-gcp-googlecompute-2020-02-04t16-30-23z 2020-02-05 23:52:21
docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z 2020-02-13 04:16:54

So the image at the failure time was:
monopacker-docker-worker-trusted-2020-02-04t16-30-23z:
fxci-level3-gcp:
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z

from:

changeset:   618:c25b1fc0a6c8
user:        Wander Lairson Costa <wander.lairson@gmail.com>
date:        Wed Feb 05 23:52:11 2020 +0000
summary:     Bug 1611255: Fix correct path for level 3 GCP image r=tomprince

diff --git a/worker-images.yml b/worker-images.yml
--- a/worker-images.yml
+++ b/worker-images.yml
@@ -120,7 +120,7 @@ monopacker-docker-worker-2020-02-03t19-2
     projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
 monopacker-docker-worker-trusted-2020-02-04t16-30-23z:
   fxci-level3-gcp:
-    projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
+    projects/fxci-production-level3-workers/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
 

following

changeset:   616:e0c509239c21
user:        Wander Lairson Costa <wander.lairson@gmail.com>
date:        Tue Feb 04 18:56:47 2020 +0000
summary:     Bug 1611255: Deploy trusted GCP builders r=Callek

diff --git a/worker-images.yml b/worker-images.yml
--- a/worker-images.yml
+++ b/worker-images.yml
@@ -10,7 +10,7 @@
 docker-worker-hvm-builder-current: docker-worker-v201912041449-hvm-builder
 docker-worker-hvm-builder-trusted-current: docker-worker-v201912041449-hvm-builder-trusted
 monopacker-docker-worker-current: monopacker-docker-worker-2020-02-03t19-23-00z
-monopacker-docker-worker-trusted-current: monopacker-docker-worker-2020-02-03t19-23-00z
+monopacker-docker-worker-trusted-current: monopacker-docker-worker-trusted-2020-02-04t16-30-23z
 
 occ-b-win2012-current: occ-b-win2012-pre
 occ-b-win2012-trusted-current: occ-b-win2012-trusted-pre
@@ -118,6 +118,9 @@ monopacker-docker-worker-2020-02-03t19-2
     projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
   fxci-levelt-gcp:
     projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
+monopacker-docker-worker-trusted-2020-02-04t16-30-23z:
+  fxci-level3-gcp:
+    projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
 

following

changeset:   611:532c0442bc8b
user:        Wander Lairson Costa <wander.lairson@gmail.com>
date:        Mon Feb 03 20:06:43 2020 +0000
summary:     Bug 1611255: Upgrade GCP workers r=Callek

diff --git a/worker-images.yml b/worker-images.yml
--- a/worker-images.yml
+++ b/worker-images.yml
@@ -9,8 +9,8 @@
 ---
 docker-worker-hvm-builder-current: docker-worker-v201912041449-hvm-builder
 docker-worker-hvm-builder-trusted-current: docker-worker-v201912041449-hvm-builder-trusted
-monopacker-docker-worker-current: monopacker-docker-2019-11-18t20-32-23z
-monopacker-docker-worker-trusted-current: monopacker-docker-worker-trusted-2020-01-10t21-13-32z
+monopacker-docker-worker-current: monopacker-docker-worker-2020-02-03t19-23-00z
+monopacker-docker-worker-trusted-current: monopacker-docker-worker-2020-02-03t19-23-00z
 
 occ-b-win2012-current: occ-b-win2012-pre
 occ-b-win2012-trusted-current: occ-b-win2012-trusted-pre
@@ -111,6 +111,13 @@ monopacker-docker-worker-trusted-2020-01
   fxci-level3-gcp:
     # uses new cot key, i.e. tk/Sj...
     projects/fxci-production-level3-workers/global/images/docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z
+monopacker-docker-worker-2020-02-03t19-23-00z:
+  fxci-level1-gcp:
+    projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
+  fxci-level3-gcp:
+    projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
+  fxci-levelt-gcp:
+    projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
 

The bug was for updating the gcp image to decrease the default reclaim time: bug 1611255 "[docker-worker] High volume of tasks failing with claim-expired in GCP"

(In reply to Dave House [:dhouse] from comment #72)

The image history based on the commits (hgmo push time, UTC) is:
docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z 2020-01-14 21:24:47
docker-worker-gcp-googlecompute-2020-02-04t16-30-23z 2020-02-05 23:52:21
docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z 2020-02-13 04:16:54

The hg-clone timeouts on gcp happened in the window from 02-06 01:50:00 to 02-13 10:12:00.

(In reply to Dave House [:dhouse] from comment #73)

(In reply to Dave House [:dhouse] from comment #72)

The image history based on the commits (hgmo push time, UTC) is:
docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z 2020-01-14 21:24:47
docker-worker-gcp-googlecompute-2020-02-04t16-30-23z 2020-02-05 23:52:21
docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z 2020-02-13 04:16:54

The hg-clone timeouts on gcp happened in the window from 02-06 01:50:00 to 02-13 10:12:00.

That was level3 images. The images for level1:

projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z 2020-02-03 21:53:19
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-07t09-14-17z 2020-02-07 19:32:36 "Bug 1613800: Deploy new level 1 GCP workers; r=tomprince"
"" Wed, 12 Feb 2020 09:46:23 "Bug 1609568: Increase the memory of GCE instances r=jlorenzogg"
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-07t09-14-17z 2020-02-13 04:16:54 "Bug 1613800 - remove level 1 monopacker images, add and use replacement trusted gcp image r=aki"

The level1 instances are still using this disk image, docker-worker-gcp-googlecompute-2020-02-07t09-14-17z

The hg-clone timeouts on level1 happened in the window 02-12 02:02:00 to 02-13 09:58:00.

I looked for an audit log in taskcluster docs, and maybe the logs are where I need to check when ci-admin applied the changes to use the images.

I checked my level1 failures. I had the incorrect tagging; the ones I'd thought from treeherder were level1 were tier 1, but ran on the level3 project instances.

I'm checking over all of my data and assumptions+conclusions about that data, but from what I found last night the failures were only on level3:

2374 failures
only occurred in time window: 02-06 01:50:00 to 02-13 10:12:00

level3 images were (estimate based on ci-conf push timestamp, datetime listed before change):
worker_pool:gecko-3.b-linux-gcp
image: monopacker-docker-worker-trusted-current
2020-01-14 21:24:47
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z
2020-02-03 21:53:19
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-03t19-23-00z
2020-02-04 20:00:43
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
2020-02-05 23:52:21
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
2020-02-06 09:34:33
projects/taskcluster-imaging/global/images/docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
2020-02-13 04:16:54
projects/fxci-production-level3-workers/global/images/docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z

Looking at project audit log the images were created:

level3 last two images (the config ref'd image on 2020-02-05 23:52:21 did not exist):

2/12/20
8:33 PM
Completed: Create image
mcrabill@mozilla.com created docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z
8:32 PM
Create image
mcrabill@mozilla.com created docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z
1/10/20
2:20 PM
Completed: Create image
mcrabill@mozilla.com created docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z
2:20 PM
Create image
mcrabill@mozilla.com created docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z

the image during the failure time was from the taskcluster-imaging project (I"m reviewing that audit log now).

tc-imaging audit last actions for the problem image:

2/12/20 
3:21 PM
Completed: Delete image
mcrabill@mozilla.com deleted docker-worker-gcp-googlecompute-2019-11-04t19-24-29z
3:21 PM
Completed: Delete image
mcrabill@mozilla.com deleted docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
--
3:21 PM
Delete image
mcrabill@mozilla.com deleted docker-worker-gcp-l3-googlecompute-2019-10-08t17-57-06z
3:21 PM
Delete image
mcrabill@mozilla.com deleted docker-worker-gcp-googlecompute-2020-02-04t16-30-23z
--
Create image
monopacker-ci@taskcluster-imaging.iam.gserviceaccount.com created docker-worker-gcp-googlecompute-2020-02-06t11-05-28z
2/4/20
9:41 AM
Completed: Create image
monopacker-ci@taskcluster-imaging.iam.gserviceaccount.com created docker-worker-gcp-googlecompute-2020-02-04t16-30-23z

(In reply to Dave House [:dhouse] from comment #77)

2/4/20
9:41 AM
Completed: Create image
monopacker-ci@taskcluster-imaging.iam.gserviceaccount.com created docker-worker-gcp-googlecompute-2020-02-04t16-30-23z

Miles, can I just set to that time-period commit (fc00dd508b) in monopacker and build the docker-worker and get the image in gcp?

Flags: needinfo?(miles)

(In reply to Dave House [:dhouse] from comment #73)

(In reply to Dave House [:dhouse] from comment #72)

The image history based on the commits (hgmo push time, UTC) is:
docker-worker-gcp-trusted-googlecompute-2020-01-10t21-13-32z 2020-01-14 21:24:47
docker-worker-gcp-googlecompute-2020-02-04t16-30-23z 2020-02-05 23:52:21
docker-worker-gcp-trusted-googlecompute-2020-02-13t03-22-56z 2020-02-13 04:16:54

The hg-clone timeouts on gcp happened in the window from 02-06 01:50:00 to 02-13 10:12:00.

Correction: I thought grafana was showing me times in my timezone, but checking on the actual first and last failures I found the UTC datetimes match the times listed in my queries:
first: 2020-02-05T18:50:58.263Z
last: 2020-02-13T03:13:01.474Z

The non-documentation changes in monopacker between the failing image's time (commit fc00dd508b) and the image I've tested (built feb 12, so likely commit debfec96) are:

diff --git a/builders/docker_worker_gcp_l3.yaml b/builders/docker_worker_gcp_community.yaml
similarity index 88%
rename from builders/docker_worker_gcp_l3.yaml
rename to builders/docker_worker_gcp_community.yaml
index b53ff99..2fc336f 100644
--- a/builders/docker_worker_gcp_l3.yaml
+++ b/builders/docker_worker_gcp_community.yaml
@@ -1,4 +1,3 @@
-# copy of docker_worker_gcp
 template: googlecompute
 platform: linux
 
diff --git a/builders/docker_worker_gcp_trusted.yaml b/builders/docker_worker_gcp_trusted.yaml
new file mode 100644
index 0000000..aa98ef0
--- /dev/null
+++ b/builders/docker_worker_gcp_trusted.yaml
@@ -0,0 +1,18 @@
+# copy of docker_worker_gcp in l3 project
+template: googlecompute
+platform: linux
+
+builder_var_files:
+  - default_linux
+  - default_gcp
+  - googlecompute_bionic
+
+script_directories:
+  - ubuntu-bionic
+  - worker-runner-linux
+  - docker-worker-linux
+
+# Note: this project disallows port 22, so baking images requires
+# temporarily allowing access
+builder_vars:
+  project_id: fxci-production-level3-workers
diff --git a/template/builders/googlecompute.jinja2 b/template/builders/googlecompute.jinja2
index 04b49e5..256953d 100644
--- a/template/builders/googlecompute.jinja2
+++ b/template/builders/googlecompute.jinja2
@@ -5,3 +5,7 @@
   ssh_username: {{builder.vars.ssh_username}}
   project_id: {{builder.vars.project_id}}
   zone: {{builder.vars.zone}}
+  {# enable nested virtualization. It allows to run the x86 android emulator
+      inside the VM with processor virtualization capabilities enabled #}
+  image_licenses:
+    - projects/vm-options/global/licenses/enable-vmx

The change between projects and enabling nested virtualization do not seem like they would change anything related to the hg clone.

It may be that the docker-worker version used was different.
The docker-worker versions around that time were:
docker_worker_version="v202002031613"
v202002031039 (not committed as a change in monopacker)
docker_worker_version="v202001201231"
docker_worker_version="v201912041449"

I built boot disk images using these four versions. I ran hg pulls with these four for the last ~6h without any hangs. I'm looking at the docker images now to see if maybe the container being used at that time had something different.

(In reply to Dave House [:dhouse] from comment #78)

(In reply to Dave House [:dhouse] from comment #77)

2/4/20
9:41 AM
Completed: Create image
monopacker-ci@taskcluster-imaging.iam.gserviceaccount.com created docker-worker-gcp-googlecompute-2020-02-04t16-30-23z

Miles, can I just set to that time-period commit (fc00dd508b) in monopacker and build the docker-worker and get the image in gcp?

Thx Miles (we met and reviewed the monopacker build and the problem image)

Flags: needinfo?(miles)

Miles suggested the docker version or packages may have changed.

We have the image created Feb13, which has been in use and on which I could not reproduce the problem. On that, we have docker 19.03.6, build 369ce74a3c.
Docker 19.03.6 was released Feb12: https://docs.docker.com/engine/release-notes/#19036
Previous version 19.03.5 was released Nov14: https://docs.docker.com/engine/release-notes/#19035

containerd.io/now 1.2.10-3 amd64 [installed,local]
docker-ce/now 5:19.03.63-0ubuntu-bionic amd64 [installed,local]
docker-ce-cli/now 5:19.03.63-0ubuntu-bionic amd64 [installed,local]

So the Feb04 image would have had the older docker version. I'll try this previous version.

I've tested with docker 5:19.03.5 and a container from a feb-11 failed task (Mp3PpbPFTKW-rkdiT2x1qA https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=288443741&repo=autoland&lineNumber=32) .
No slow/fail/hung clones yet across the three docker-worker versions around the failure time (a few zeros from when I started the tests incorrectly)
Images for test:

docker-worker-gcp-googlecompute-v202002031613-docker19-03-5 	us 	1.27 GB 	10 GB 	taskcluster-imaging 		Mar 24, 2020, 2:45:27 PM 	
docker-worker-gcp-googlecompute-v202002031039-docker19-03-5 	us 	1.27 GB 	10 GB 	taskcluster-imaging 		Mar 24, 2020, 2:35:14 PM 	
docker-worker-gcp-googlecompute-v202001201231-docker19-03-5

GCP support concluded they did not find more information about the deleted disk image, the instances that had the hg-timeout/failure, or find anything that looked unusual or that might explain the problem. We closed the support case.

I completed more tests on the docker version from the failure time and have not found anything.

I am looking next at pushing a full set of builds through gcp workers to test this. The per-instance cost is estimated in the gcp console as:

That's about $0.319 hourly

Pay for what you use: No upfront costs and per second billing
Item 	Estimated costs
32 vCPUs + 72 GB memory 	$202.34/month
10 GB standard persistent disk 	$0.40/month
375 GB local SSD scratch disk 	$30.00/month
Sustained use discount
	- $0.00/month
Total 	$232.74/month 

During a regular week, our b-linux numbers peak around 750:
https://earthangel-b40313e5.influxcloud.net/d/MZihUsuWz/gcp-aws-builder-counts?orgId=1&from=now-7d&to=now

Maximum instance count for aws builders this week so far was 734, and 110 on gcp. Last week it was 832, and 103. And 6 weeks ago, Feb 16-22, was 769. The maximum over the last 90days were just under 1100 simultaneous builders.

Blocks: tc-gcp

April 22/23, I ran a production size load copying production tasks and there were no timeout problems (similar failures to what was on aws at the time.):

2402 total tasks ran:
2212 completed
190 failed (133 connection reset on hg clone/pull (similar to the timeouts seen in February, but caught immediately -- 70 of these were also aborted prematurely in aws), 2 worker container timed-out, 5 screen session fails, 32 TBPL failures, 8 build failures, 6 no logs)

Of the failures on gcp,
47 of those failed in aws
70 were aborted prematurely on aws

So, we saw 3% (73 of 2402 tasks) fail on gcp that did not fail on aws. All from a connection reset with the hg pull/clone.

Connor, do you know if we see more connection resets for bundle downloads in gcp than in aws? I attached the 10lines tail for each of 133 failures with hg connection resets I saw in this set from April 23-24 (I'll collect more recent ones also, and look for the same failures after updating the worker image).

It looks like they were all against the bundle:
"applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/559b28bce1555d02dc1e19d63e4a64e4d54130ed.stream-v2.hg"

Flags: needinfo?(sheehan)

I'm wondering if, although google support did not find problems, we are exceeding the limits on the single-site hg bundles (in gcp's us-central1) re: bug 1587958. I did not count the number of simultaneous downloads at the failures times.

It looks like the connection resets on bundless are grouped by the time they occur.

I copied 100 tasks yesterday (from gecko-3/b-linux to gecko-3/b-linux-gcp) and 7 hit connection resets on downloading the bundle. 2 of those 7 saw a worker-shutdown interruption on aws.

All seven failures happened within about 20 seconds (the one in-progress download is logged as failed 30s later):

[vcs 2020-05-20T21:44:53.048Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:44:53.114Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:44:53.783Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:45:07.208Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:45:07.496Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:45:08.191Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:44:46.501Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:44:46.558Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:44:47.043Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:45:37.770Z] clone [=========================>                   ] 2072127653/3582756450 40s
[vcs 2020-05-20T21:45:37.770Z]                                                                                 
[vcs 2020-05-20T21:45:37.770Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:45:04.744Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:45:04.806Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:45:04.973Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:44:50.632Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:44:50.728Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:44:51.318Z] abort: Connection reset by peer
--
[vcs 2020-05-20T21:44:50.474Z] applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/ec7aebcb24cb7479ed9dddfa7f52abf38091660c.stream-v2.hg
[vcs 2020-05-20T21:44:50.533Z] 582964 files to transfer, 3.34 GB of data
[vcs 2020-05-20T21:44:50.711Z] abort: Connection reset by peer

(In reply to Dave House [:dhouse] from comment #91)

I'm wondering if, although google support did not find problems, we are exceeding the limits on the single-site hg bundles (in gcp's us-central1)

So I just took a look at the GCS quotas and limits page and as far as I can tell, there is no limit on the number of concurrent reads for an object: "There is no limit to reads of an object. Buckets initially support roughly 5000 object reads per second and then scale as needed."

It's not a surprise that we see all connections happening against the same bundle. Bundles are generated nightly and come in multiple formats, one of which is the "stream" bundle which if the fastest to download on high-speed networks. So we would expect the same GCS-hosted stream bundle to be downloaded by most clients in GCP-CI on a given day.

To answer your question more directly: I'm not sure if we see more connection drops on GCP than AWS.

Flags: needinfo?(sheehan)

(In reply to Connor Sheehan [:sheehan] from comment #93)

(In reply to Dave House [:dhouse] from comment #91)

I'm wondering if, although google support did not find problems, we are exceeding the limits on the single-site hg bundles (in gcp's us-central1)

So I just took a look at the GCS quotas and limits page and as far as I can tell, there is no limit on the number of concurrent reads for an object: "There is no limit to reads of an object. Buckets initially support roughly 5000 object reads per second and then scale as needed."

It's not a surprise that we see all connections happening against the same bundle. Bundles are generated nightly and come in multiple formats, one of which is the "stream" bundle which if the fastest to download on high-speed networks. So we would expect the same GCS-hosted stream bundle to be downloaded by most clients in GCP-CI on a given day.

To answer your question more directly: I'm not sure if we see more connection drops on GCP than AWS.

Thanks!

re: hg issues with shoddy connections: Is there a retry adjustment we could make on the clone to work-around a 30-60second down-time for the bundles in gcp? I'm assuming that the grouped failures I saw last night are a pattern for the issue. If there isn't a config for hg, then I may look at just retrying the task step of the clone, on a connection failure.

Flags: needinfo?(sheehan)

I don't think there's a config for hg that specifically targets the clonebundle download step.

One of the things robustcheckout does is add retry logic around various network operations. We can patch it to retry when performing the bundle download (either in this bug or a new one). Having TC retry when a bundle download fails could work as well.

Flags: needinfo?(sheehan)
See Also: → 1653171

Are we still debugging this? If not, it may be time to shut these off entirely for now.

If switching to gcp becomes something we need to look at again this year, then I think we need to debug the clonebundle timeout.
I would like to keep the mirror builds going in gcp in-frequently so that we know we could switch over. However, I could be wrong that that provides any value.
What do you think? Or, who is deciding when the builds move to gcp?

Flags: needinfo?(bhearsum)

It's worth noting that the GCP builds are still slower than those on AWS, both because the machines are slower despite having more cores, and because the backing storage (equivalent to S3) for sccache is slower too.

(In reply to Dave House [:dhouse] from comment #97)

If switching to gcp becomes something we need to look at again this year, then I think we need to debug the clonebundle timeout.
I would like to keep the mirror builds going in gcp in-frequently so that we know we could switch over. However, I could be wrong that that provides any value.
What do you think? Or, who is deciding when the builds move to gcp?

After digging into cost ($200-$300/month), I'm fine to leave them on. I don't know who is making the decision on whether or not to move, though.

Flags: needinfo?(bhearsum)

Builds have been running in GCP for awhile now and shippable just switched over recently. I think we should close this and re-open a new bug if we want to prioritize improving performance again.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: