Closed Bug 1520588 Opened 5 years ago Closed 5 years ago

Deploy shipitscript into GCP

Categories

(Release Engineering :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: rail)

References

Details

(Whiteboard: [releng:q12019])

No description provided.

:oremj would it be possible to create web-less instances (just like we did for shipit worker) for new project: scriptworker/shipit.

As usual we would need this for 3 environments:

  • testing (docker tag: scriptworker/shipit-docker-testing-latest)
  • staging (docker tag: scriptworker/shipit-docker-staging-latest)
  • production (docker tag: scriptworker/shipit-docker-production-latest)

we will also need a network access to shipit

Flags: needinfo?(oremj)
Flags: needinfo?(oremj)

(In reply to Rok Garbas [:garbas] from comment #1)

  • testing (docker tag: scriptworker/shipit-docker-testing-latest)
  • staging (docker tag: scriptworker/shipit-docker-staging-latest)
  • production (docker tag: scriptworker/shipit-docker-production-latest)

I think we need to fix these tags, slashes are not allowed.

:autrilla

new tags are:

for now we can skip the production, until we figure out how to handle secrets.

Flags: needinfo?(autrilla)

I updated the images and they are ready (tested on my laptop, TM) to go. I have the secrets, just need to hand them over.

We also need to figure out how to properly configure the network, so we can connect to both ship-it APIs (until we retire v1).

  • ship-it v1 is hosted by IT and requires a VPN connection (vpn_shipit for prod and vpn_shipitdev LDAP groups I believe). ericz may have better info.

  • ship-it v1 is hosted by cloudops and restricted by IP and vpn_cloudops_shipit is the LDAP group we use to add users. I'm not sure if the LDAP group is relevant in this case.

(In reply to rail@mozilla.com from comment #4)

I updated the images and they are ready (tested on my laptop, TM) to go. I have the secrets, just need to hand them over.

Great! Is there any difference to how the image should be ran on each environment, other than the secrets?

We also need to figure out how to properly configure the network, so we can connect to both ship-it APIs (until we retire v1).

  • ship-it v1 is hosted by IT and requires a VPN connection (vpn_shipit for prod and vpn_shipitdev LDAP groups I believe). ericz may have better info.

This might be a bit problematic, I thought we only needed to talk to v2. I imagine IT would want us to have a single static IP from which we talk to ship-it, is that so :ericz?

I haven't done anything like this before on GCP, but someone from my team has, and AIUI it was for applications we control, not for something ran by IT.

  • ship-it v2 is hosted by cloudops and restricted by IP and vpn_cloudops_shipit is the LDAP group we use to add users. I'm not sure if the LDAP group is relevant in this case.

Talking to ship-it v2 won't be an issue since they're both in the same cluster and we won't need to cross into the internet.

Flags: needinfo?(autrilla) → needinfo?(eziegenhorn)

(In reply to Adrian Utrilla [:autrilla] from comment #5)

Great! Is there any difference to how the image should be ran on each
environment, other than the secrets?

The command line is the same (the default CMD directive). They are configured to use different configs depending on the env/secrets.

To talk to ship-it v1 I think we'd have to set it up on a public-facing load balancer with a different DNS name and then we could potentially limit it by IP address (or maybe something else but offhand I can't think of anything better).

Flags: needinfo?(eziegenhorn)

:ericz, all our traffic from our nonprod (staging and testing) environments will come from 35.197.23.59. Could you whitelist this so we can talk to shipitv1 from it? Let me know if you need any more information to do this.

Flags: needinfo?(eziegenhorn)
See Also: → 1525746

I'm spinning off that work in new bug 1525746.

Flags: needinfo?(eziegenhorn)

Sigh, the name won't resolve, because we use split-horizon DNS.

https://tools.taskcluster.net/groups/VCn9f5PISjORFOZu84jhqA/tasks/arEl7VusQymG7S3Tt0yJDA/runs/0/logs/public%2Flogs%2Flive_backing.log

2019-02-15 03:04:59,899 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): ship-it-dev.allizom.org:443
Traceback (most recent call last):
File "/nix/store/sfx431rh4x09nv0sgripmn01rf6pwdb6-python3.7-urllib3-1.24.1/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/nix/store/sfx431rh4x09nv0sgripmn01rf6pwdb6-python3.7-urllib3-1.24.1/lib/python3.7/site-packages/urllib3/util/connection.py", line 57, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/nix/store/sh0rq55jaambzqx59g0kdk59g23vj8m6-python3-3.7.0/lib/python3.7/socket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The good thing that the worker takes the tasks from the queue. \o/

We may want to tweak the worker name a bit, it uses the hostname (which is not unique) and equals to 22 first characters of the k8s workload name (scriptworker-stage-shipitapi-app-1 -> scriptworker-stage-shi). I'm not even sure that the name would be somehow useful...

Adrian, do we autodeploy the docker images to stage/testing now? I tried to test a workaround, but it looks like scriptworker-stage-shipitapi-app-1 is still using the images from Feb 6.

Flags: needinfo?(autrilla)

We did not when you commented, but we do now. There's an up-to-date image in staging now.

Flags: needinfo?(autrilla)

Thank you!

Blocks: 1533337
See Also: → 1536853

We finally dropped ship-it v1 and don't need any special routes to MDC1/2. We can undo the special settings.

Now I'm getting a 403 from https://api.shipit.staging.mozilla-releng.net/ when I try to run shipitscript. The idea was that they are in the same cluster, so the IP-based whitelisting works either out of the box or without any extra setup.

Adrian, can you

  1. get rid of the customization made in comment #8, no need to communicate to ship-it v1 anymore. No rush with this.

  2. make sure that scriptworker-stage-shipitapi-app-1 is whitelisted in either in shipitapi-dev-shipitapi-app-1 or shipitapi-stage-shipitapi-app-1. I always forget which one corresponds to our staging :/ Maybe it'll resolve itself if you get rid of 1)

Probably it'd be better to align the names at some point, to get rid of this dev/stage/staging confusion with shipit.

Thank you in advance!

Flags: needinfo?(autrilla)

Regarding the 403, it's because you're trying to talk to it through the public IP. You should be able to connect to the Kubernetes service directly over HTTP (not HTTPS, since we terminate that at the edge).

In stage.shipitapi.nonprod.cloudops.mozgcp.net, this is http://shipitapi-stage-shipitapi-app-1.

In testing.shipitapi.nonprod.cloudops.mozgcp.net, this is http://shipitapi-testing-shipitapi-app-1.

D'oh... We enforce HTTPS in our apps, so I get a 302 to the HTTPS:

2019-04-23T14:56:33 INFO - 2019-04-23 14:56:33,862 - urllib3.connectionpool - DEBUG - http://shipitapi-stage-shipitapi-app-1:80 "PATCH /releases/Fennec-67.0b3-build1 HTTP/1.1" 302 345
E
2019-04-23T14:56:33 INFO - 2019-04-23 14:56:33,864 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): shipitapi-stage-shipitapi-app-1:443

Need to think what to do....

This could be our NGINX redirecting you. If you send an X-Forwarded-Proto header set to https, that should prevent NGINX from redirecting. Not sure if that's doable. Otherwise we could expose the application directly instead of through NGINX through a Kubernetes Service, but that's not ideal.

Yeah, it's getting a bit hairier that I thought. :)

Rok, maybe you have some ideas?

There are a couple issues:

  1. When I use FQDN to access the API endpoint, the requests end up hitting the public IP, which requires whitelisting of the k8s replicas, what defeats the idea that we should bypass public routes in the same cluster.

I wonder if the source IPs of the requests coming from the same cluster should be the same with the public IP of that cluster, so we can easily whitelist it.

  1. If I use the k8s names (e.g. shipitapi-stage-shipitapi-app-1), then I have to use http instead of https, but flask-talisman redirects to https in our case. Then the request times out.

I can hack the client requests and set the X-Forwarded-Proto header to "https". In this case we bypass flask-talisman, but now I hit an issue with mohawk, which verifies the auth headers, but falls back to using port 443 for some reason instead of 80. Probably it fails to properly guess the port in https://github.com/mozilla/release-services/blob/30fe29c037cb2a58d64ebdbf6dcf5b1456e14820/lib/backend_common/backend_common/auth.py#L391.

TBH, 2) sounds a bit dirty and hacky. :/

Any other alternatives?

Flags: needinfo?(rgarbas)

We chatted about this today with Rok and I think I'm going to take the second route. It will not require any special changes neither in ship-it or GCP/k8s. This way we don't rely on special setup, but only on the client.

Flags: needinfo?(rgarbas)
Flags: needinfo?(autrilla)
Depends on: 1547317

Looks like we are ready to go with prod in bug 1547317. Let's do eet! :)

Found in triaging. We moved shipitscriptworkers into GCP a while ago in bug 1581149.
I think we can close this for now. Feel free to re-open should I'm wrong.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.