Closed Bug 1585133 Opened 3 years ago Closed 3 years ago

Optimize hgmo for GCP

Categories

(Developer Services :: Mercurial: hg.mozilla.org, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sheehan, Assigned: sheehan)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

We need to do some work to port various AWS optimizations for hg.mo over to GCP. Namely:

  • Uploading bundles to the GCP equivalent of AWS S3.
  • Determining if the origin IP address for a request comes from a GCP advertised IP block, ie the GCP equivalent of AWS' IP ranges document, which appears to be this process.
  • Prioritizing stream clone bundles from the same GCP region to requests coming from these IP blocks.

(In reply to Connor Sheehan [:sheehan] from comment #0)

We need to do some work to port various AWS optimizations for hg.mo over to GCP. Namely:

  • Uploading bundles to the GCP equivalent of AWS S3.
  • Determining if the origin IP address for a request comes from a GCP advertised IP block, ie the GCP equivalent of AWS' IP ranges document, which appears to be this process.
  • Prioritizing stream clone bundles from the same GCP region to requests coming from these IP blocks.

Connor: can you estimate how much work would be required for this? Are we talking days/weeks/months?

Flags: needinfo?(sheehan)

It should be a few days work - I've already made some considerable progress on it. I'm away starting tomorrow afternoon, returning next Thursday (Oct3-9, back the 10th). I'll do my best to have it deployed shortly after I return.

Flags: needinfo?(sheehan)

(In reply to Connor Sheehan [:sheehan] from comment #2)

It should be a few days work - I've already made some considerable progress on it. I'm away starting tomorrow afternoon, returning next Thursday (Oct3-9, back the 10th). I'll do my best to have it deployed shortly after I return.

Connor: checking in now that you're back, do you still think you'll be able to tackle this ASAP? We're definitely still hitting hg bottlenecks in GCP.

Flags: needinfo?(sheehan)

(In reply to Chris Cooper [:coop] pronoun: he from comment #3)

Connor: checking in now that you're back, do you still think you'll be able to tackle this ASAP? We're definitely still hitting hg bottlenecks in GCP.

Yes, I'm making progress here still. The remaining work is to determine what GCS bucket storage class we should be using, then create the bundles and point them at the new buckets. If we're experimenting in a single GCE region we can simply create a bucket there and serve bundles from that region for all incoming GCP requests. If we're already in multiple regions, we may need to do some more work/I will need to speak with someone from CloudOps to determine the best path forward here, due to limitations in GCP APIs.

Which GCE regions are we running the builds out of?

Flags: needinfo?(sheehan) → needinfo?(coop)

I spoke with Brian and apparently we are in us-central1 only, for the time being. This will allow me to complete this optimization fairly easily, after which I'm going to get started on the work to stand up private hgweb mirrors for GCP. Having the mirrors stood up will allow me to work around the aforementioned API limitations in GCP.

Flags: needinfo?(coop)

This commit adds a new subcommand to scrape-manifest-ip-ranges.py,
which scrapes Google's DNS records to gather information about it's
public IP address blocks. The process implemented in this commit is
outlined in Google's cloud support docs. [1]

To summarize, we use the dnspython DNS toolkit to first run a
query for _cloud-netblocks.googleusercontent.com. This query
will return a list of domains, each returning a set of IP blocks
for Google Cloud Platform services. The resulting blocks are then
saved to a file on disk. An example output looks like:

ip4:35.199.0.0/17
ip4:35.199.128.0/18
ip4:35.235.216.0/21
ip6:2600:1900::/35
ip4:35.190.224.0/20

[1] https://cloud.google.com/compute/docs/faq#find_ip_range

This commit adds a systemd unit and timer to schedule runs of the
GCP address scraper on hg-web. The unit and timer are copies of
the AWS scraper's unit/timer, except the gcp subcommand of
the manifest scraper script is called instead.

We will need this dependency for an upcoming commit which adds code
to query DNS records.

The bundle generation script will soon need to upload files to Google
Cloud Storage. This commit updates requirements-bundles.txt to add
the required SDK dependencies.

This commit adds Terraform configs for a GCS bucket and service account
required to publish Mercurial clonebundles to GCP. The service account
represents the hgssh master server process which generates the bundle
and uploads to GCP, with the corresponding key being used for credentials.
The bucket is created with a retention policy and lifecycle policy of
7 days. The retention policy holds data as undeletable for a minimum of
7 days and the lifecycle policy deletes the data after it is 7 days
past expiration time.

This commit extends the clonebundle generation and upload script to
also upload generated bundles to a GCS bucket in us-central1. The
format from the S3 bundle upload was mostly replicated and GCS APIs
were substituted for the S3 APIs. Most region-specific operations are
left in loops to facilitate easily extending into more GCS regions.

test-clonebundles.t was updated to reflect new clonebundles manifest
entries and the bundleclone.rst documentation includes the new
gceregion bundle attribute.

This commit teaches the hgmo extension to prioritize stream clone bundles
when responding to clone requests from IP addresses in GCP. To do so we
make the filter_manifest_for_aws_region more generic to account for the
new GCP regions. We add a new config option which points to a path on disk
where the previously added GCP IP scraper will dump a file containing IP
addresses for known GCP blocks. This file is mocked out by adding an
example file to the docker-hg-web Ansible role.

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/4cbcfeb98791
ansible/hg-web: add dnspython to venv_tools3 on hgweb r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/833e7e7f3c2f
scripts: add gcp option to scrape-manifest-ip-ranges.py r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/b86cd8ce560a
ansible/hg-web: add systemd unit and timer for GCP IP address scrape r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/7bf7c9fbab4f
ansible/hg-ssh: add google-cloud-storage dependency to requirements-bundles.txt r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/dfc58dbfcca3
terraform: create resources to store Mercurial clonebundles in GCP r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/9e2203249fdb
hgserver: extend bundle generation script to upload to GCS r=smacleod
https://hg.mozilla.org/hgcustom/version-control-tools/rev/6c25d57a7552
hgmo: prioritize stream clone bundles when cloning from GCP r=smacleod

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

This has landed but needs to be deployed and tested in production to assert the GCP upload/download works as intended. I'll be taking care of that tomorrow morning.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/9d5409716a91
terraform: switch from bucket ACL to IAM member policy
https://hg.mozilla.org/hgcustom/version-control-tools/rev/b7590e298924
terraform: grant admin bucket privileges for hgbundler service account
https://hg.mozilla.org/hgcustom/version-control-tools/rev/48bc0a9aa838
bundles: fix busted import of Google cloud SDK
https://hg.mozilla.org/hgcustom/version-control-tools/rev/9ecc4d94fa5b
ansible/hg-ssh: specify path to hgbundler credentials file

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED

Need to deploy and test one last piece.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

This is deployed. Now when running clone tasks from within GCP the initial download of the bundle will come from GCS. It will also be a stream-clone bundle, which is better on fast networks.

Full download and working directory checkout:

cosheehan@instance-test-google-bundles:~$ time /home/cosheehan/.local/bin/hg clone https://hg.mozilla.org/mozilla-unified
destination directory: mozilla-unified
applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/mozilla-unified/fa97283e9f5d89b55d24eeb4171036bd34d12f00.packed1.hg
545204 files to transfer, 3.08 GB of data
transferred 3.08 GB in 62.0 seconds (50.8 MB/sec)                                                                                                                                                                                                     
finished applying clone bundle
searching for changes
adding changesets
adding manifests                                                                                                                                                                                                                                      
adding file changes                                                                                                                                                                                                                                   
added 389 changesets with 6891 changes to 6244 files (+2 heads)
new changesets 5d748daa45d3:8a47372311a9
updating to branch default
(warning: large working directory being used without fsmonitor enabled; enable fsmonitor to improve performance; see "hg help -e fsmonitor")
282800 files updated, 0 files merged, 0 files removed, 0 files unresolved                                                                                                                                                                             
real    3m45.904s
user    3m33.910s
sys     0m54.175s

Working directory checkout with a cached repo:

cosheehan@instance-test-google-bundles:~$ time /home/cosheehan/.local/bin/hg share mozilla-unified/ share-unified
updating working directory
(warning: large working directory being used without fsmonitor enabled; enable fsmonitor to improve performance; see "hg help -e fsmonitor")
282800 files updated, 0 files merged, 0 files removed, 0 files unresolved                                                                                                                                                                             
real    2m4.927s
user    2m33.870s
sys     0m36.591s

Tested on a n2-standard-2 (2 vCPUs, 8 GB memory), premium network tier, standard persistent disk. The performance for checkouts might be improved on the build instances (as is the case here, where it only takes ~45s), this test was mostly to make sure the new code was functioning correctly in production. But the initial clone of 18-20m should instead be as fast as Google's networks will allow us to transfer bits.

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED

Thanks, Connor.

I kicked off a Try run to test this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=20c99f633f3c035d2f2f66d017f97f9677ed7201

If there's an existing clone on a worker, am I going to see any improvement, or any evidence otherwise in the log?

Builds are still completing, but if I look at the linux64 opt plain build that I normally use as a metric, I can't tell from the log whether anything has changed. Granted, it's an existing clone, but should I be worried about "region gecko-1 not yet supported?"

[vcs 2019-10-25T00:14:52.667Z] fetching hgmointernal config from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgmointernal
[vcs 2019-10-25T00:14:53.057Z] region gecko-1 not yet supported; using public hg.mozilla.org service
[vcs 2019-10-25T00:14:53.057Z] fetching hg.mozilla.org fingerprint from http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint
[vcs 2019-10-25T00:14:53.184Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--config', 'hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '20c99f633f3c035d2f2f66d017f97f9677ed7201', 'https://hg.mozilla.org/try', '/builds/worker/workspace/build/src']
[vcs 2019-10-25T00:14:53.282Z] (using Mercurial 4.8.1)
[vcs 2019-10-25T00:14:53.282Z] ensuring https://hg.mozilla.org/try@20c99f633f3c035d2f2f66d017f97f9677ed7201 is available at /builds/worker/workspace/build/src
[vcs 2019-10-25T00:14:53.862Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2019-10-25T00:14:54.136Z] (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)```
Flags: needinfo?(sheehan)

No, this won't make a difference if there's an existing clone on the worker. In that case we would still need to hg pull the new changes from the public hgweb endpoint in MDC1 (looks to take about 10s from the log in comment 18), then perform a working directory checkout on the worker, which takes about 45s.

The line about "region gecko-1 not yet supported" relates to private mirrors. run-task fetches a Taskcluster secret and checks if the value of TASKCLUSTER_WORKER_GROUP environment variable is a key in the secret. If the key exists, the worker group is supported for private hgweb mirrors and the value that maps to the key contains some configuration regarding how to communicate with the private mirror. Since we don't have mirrors for GCP yet, it's expected we see that line in the logs.

In tasks where this change will make a difference, after "cloning from upstream repo", we won't see "sharing from existing pooled repository". Instead we'll see the output from comment 17, "applying clone bundle from https://storage.googleapis.com/moz-hg-bundles-gcp-us-central1/<repo>/<revision>.packed1.hg".

Flags: needinfo?(sheehan)
You need to log in before you can comment on or make changes to this bug.