Closed Bug 1843356 Opened 1 year ago Closed 1 year ago

remove support for private symbols bucket in tecken

Categories

(Tecken :: General, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(4 files)

Per bug #1838892, we're dropping support for the private symbols bucket.

This bug covers:

  1. demote the accounts that were noted in UPLOAD_URL_EXCEPTIONS so they no longer have upload privileges
  2. remove the UPLOAD_URL_EXCEPTIONS configuration variable and everything that uses it in the tecken codebase
  3. remove the access=public/private stuff from SYMBOL_URLS handling
  4. remove the UPLOAD_URL_EXCEPTIONS configuration in the infrastructure codebase
  5. write up a DSRE bug to remove the private symbols bucket for stage and prod

Grabbing this to work on now. I removed uploader privileges for anyone listed in the UPLOAD_URL_EXCEPTIONS.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

The public/private stuff affects how the StorageBucket works and switching everything to public (aka getting rid of private) causes many many many tests to fail because they were all depending on the bucket being private which forces the boto codepath for download which is mocked out in the tests.

In order to remove the public/private stuff, I need to rewrite the tests so they're not using botomock for upload and download and instead using localstack (a fake s3 service) for upload and http for download (bug #1834626). I wanted to do that anyway, but not necessarily in this pass, but the two things are tangled so I'm just going to do it all.

This needs extensive testing on stage.

Before landing this, I ran the system tests against stage to get a baseline for what the output looks like. Everything looks ok.

Once this deploys to stage, I will:

  1. re-upload those same zip files to make sure duplicate-identification works
  2. run the system tests with a new set of zip files to verify everything (upload and download) works
  3. go through the site to make sure all the pages work

I ran the system tests on stage before and after. All the tests did fine with the uploading API. However, the downloading API is failing.

https://mozilla.sentry.io/issues/4337235850/?query=is%3Aunresolved&referrer=issue-stream&stream_index=0

On stage, the first url in SYMBOL_URLS is:

https://s3-us-east-1.amazonaws.com/symbols-webeng-stage

If I try to HEAD a symbol file in the Tecken symbols stage public bucket, I get a dns error:

curl --head 'https://s3-us-east-1.amazonaws.com/symbols-webeng-stage/v1/libfreebl3.dylib/D260D4EA1F983017A5C99E944CF653B30/libfreebl3.dylib.sym'
curl: (6) Could not resolve host: s3-us-east-1.amazonaws.com

Tecken prod has this as the first url in SYMBOL_URLS:

https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public

If I do the same thing to the Tecken symbols prod public bucket, it works fine:

curl --head 'https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/libfreebl3.dylib/D260D4EA1F983017A5C99E944CF653B30/libfreebl3.dylib.sym'
HTTP/1.1 200 OK
x-amz-id-2: og8g+iJaAQSz599xyOItWA3xrR2ddTtYW7eUk6iJbFbBmbcuZhXF617Wg+lbwMrwz0IkJwmaWKI=
x-amz-request-id: 7MNECFR7YRHP8QJ7
Date: Mon, 31 Jul 2023 14:19:47 GMT
Last-Modified: Thu, 27 Jul 2023 17:18:54 GMT
x-amz-expiration: expiry-date="Sun, 27 Jul 2025 00:00:00 GMT", rule-id="first_ia_then_delete"
ETag: "12a8670fac7df82bc5b80bf202b6a8e6"
x-amz-server-side-encryption: AES256
x-amz-meta-original_md5_hash: b6e10cca0af2c864f586302446aa82f6
x-amz-meta-original_size: 771035
Content-Encoding: gzip
x-amz-version-id: ar1QIPcdxx34ARB85_pqhtx6a_ro0Caq
Accept-Ranges: bytes
Content-Type: text/plain
Server: AmazonS3
Content-Length: 201727

I think one of a couple of things is wrong with Tecken stage:

  1. the SYMBOL_URLS value is wrong; wrong region? wrong structure?
  2. the symbols-webeng-stage bucket policy is different than the one for org.mozilla.crash-stats.symbols-public and doesn't allow HTTPS access--though I don’t know why this would cause a DNS error

My hunch is that the SYMBOL_URLS value is wrong and has been ever since the stage environment was set up years ago and because the bucket was being checked using boto since it didn't have access=public in the url, it ignored the host in the url and worked fine. Now that that code is out (no more public/private), it's using HTTP HEAD which uses the host in the url and fails.

I wrote up a Jira ticket to get some SRE help to look into it.

There were two issues:

  1. We were using a legacy endpoint where it's s3-us-east-1... with a hyphen between the service and region and that doesn't work with us-east-1. We were advised to switch to s3.us-east-1....

    https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#s3-legacy-endpoints

  2. Once we fixed the endpoint url, then we hit an HTTP 403 when requesting things in the bucket. Harold added a bucket policy.

Now the download API works.

One interesting thing is that this means that the download API never worked on stage, but because the stage SYMBOL_URLS has the second url as prod, it would always degrade to redirecting to prod. Fun!

I need to go through and test everything on stage. I'm also going to make some minor changes to the systemtest scripts

I went through and tested everything on stage:

  1. system test against stage:
    • tests uploading symbol files using the upload API
    • tests uploading symbol files using the upload API and the upload-by-download path
    • tests the download API
  2. I built a new symbol zip file and uploaded it multiple times; second+ upload didn't reupload because the items were the same as what was in the bucket already
  3. I verified that the download API lets you download symbol files from the stage symbols bucket and can degrade to the prod symbols bucket when the requested file isn't in the stage bucket
  4. I went through the upload, upload files, file, and django admin site status pages to make sure they all worked correctly.

Everything looks good. I think this is ready to go to production now. Whew.

Once we push this to production, we still need to do the following cleanup tasks:

  1. remove the access=public/private stuff from SYMBOL_URLS handling
  2. remove the UPLOAD_URL_EXCEPTIONS configuration in the infrastructure codebase
  3. write up a DSRE bug to remove the private symbols bucket for stage and prod

This went to prod in bug #1847570.

There's an increase in mean time (ms) to handle a download API request going from 10ms to ~20ms.

It's possible this is because we just did a deploy and we've got a cold cache or something like that. It's also possible that this is because of a code change. Previously, the code did a HeadObject call using the boto library client. Now it does an HTTP HEAD call using requests. It's possible the former was faster--maybe it involved less set-up/take-down time. It's also possible both are true.

I checked the Socorro dashboard and there's no noticeable change in the stackwalker runtime.

I'll keep an eye on it, but even if it is a regression, I think it's ok for now since this it's a ~10ms difference on something that's pretty fast. We can optimize it after we rewrite the storage layer which will give us a lot more options for how to figure out if a file exists in a bucket and after we migrate to GCP.

Also, it looks like we stopped emitting some metrics:

  • tecken.symboldownloader_exists
  • tecken.symboldownloader_exists_cache_hit
  • tecken.symboldownloader_exists_cache_miss

Other than that, everything looks ok at this point.

I want to re-add those metrics. Also, since this is a large change I'll take a look at everything again tomorrow.

Re: the regression, the graphs settled back down at a mean of 10ms, so it was probably a cold cache or something like that.

I figured out what happened with the missing metrics. We had two different "does this thing exist?" functions and only one of them was being used which emitted those metrics. When I removed the private/public support code, that changed which one is used so now the other one is used. It emits different metrics:

  • tecken.symboldownloader_public_exists
  • tecken.symboldownloader_public_exists_cache_hit
  • tecken.symboldownloader_public_exists_cache_miss

I'll fix this. It probably also means I need to make sure that cache busting is using the correct function and works.

willkg merged PR #2772: "bug 1843356: fix check_url_head metrics" in 1907629.

This changes check_url_head to use the old metrics keys that were emitting data in our dashboards.

This deployed to prod just now in bug #1848164.

I'm seeing metrics for symboldownloader cache usage again.

Last thing to do here is:

  1. write up a DSRE bug to remove the private symbols bucket for stage and prod

I wrote up DSRE-1401 to cover decommissioning the stage and prod private symbols buckets.

We're all done here.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: