Closed Bug 1536130 Opened 6 years ago Closed 6 years ago

Autograph returning 401 and 502 errors for the Lockbox Android App

Categories

(Cloud Services :: Operations: Autograph, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dreams, Unassigned)

References

(Blocks 1 open bug)

Details

Fairly recently the stage and prod endpoints have started returning 401 and 502 error codes. The 502s appear seemingly more frequently when running during bitrise builds for signing for the Lockbox Android app:

https://github.com/mozilla-lockbox/lockbox-android/issues/501

Hey Devin, this is expected.

I didn't finish the autograph deploy to prod for bug 1533181 last week.

Planning to get that out this Wednesday.

Depends on: 1533181

Ah, OK thank you for the quick response, Greg. :)

Ah I misread. I thought this was about the Send APK which is on the HSM stack (Lockbox APK is on the legacy stack).

Tested and this is working at the moment:

$ rm test-signed.apk; curl -F "input=@test.apk" -o test-signed.apk -H "Authorization: $LOCKBOX_PROD_TOKEN" https://autograph-edge.prod.mozaws.net/sign

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14389  100  8563  100  5826  14662   9976 --:--:-- --:--:-- --:--:-- 24596
$ apksigner verify --print-certs --verbose test-signed.apk
Verifies
Verified using v1 scheme (JAR signing): true
Verified using v2 scheme (APK Signature Scheme v2): false
Number of signers: 1
Signer #1 certificate DN: CN=Release Engineering, OU=Release Engineering, O=Mozilla Corporation, L=Mountain View, ST=California, C=US
Signer #1 certificate SHA-256 digest: 64d26b507078deba2fee42d6bd0bfad41d39ffc4e791f281028e5e73d3c8d2f2
Signer #1 certificate SHA-1 digest: 778a92ea4ba1c6fe1b883108b8b9f832d6fd8ae2
Signer #1 certificate MD5 digest: e51c31dd0a2652cc591be6f690aa6f02
Signer #1 key algorithm: RSA
Signer #1 key size (bits): 2048
Signer #1 public key SHA-256 digest: 42b03e9c46258be9c3d369bd6a8ed25376723f6ed884cf93cdd2dc3011162251
Signer #1 public key SHA-1 digest: 6b3d026ee14d5f01c1116d59603020370eddf0e8
Signer #1 public key MD5 digest: 22d2b18899aa4f3e1ca3bf86ba8a2164

I'll check the logs to see what happened around the failures mentioned in the github issue.

Blocks: autograph
No longer depends on: 1533181
Summary: Autograph returning 401 and 502 errors → Autograph returning 401 and 502 errors for the Lockbox Android App

These errors are consistent with a failed deploy or old box in an autoscaling group, since we'd get 401s for invalid creds in autograph edge and 502s for invalid upstream creds in the autograph app (no signer or missing signer).

Yeah hmm...

I get sporadic successes and just saw one right now succeed against prod:
https://app.bitrise.io/build/a2252e8546207c6e

And one fail right after (it started at the same time/commit):
https://app.bitrise.io/build/8af613a1ad569b4b

(In reply to Devin Reams (dreams) from comment #5)

Yeah hmm...

I get sporadic successes and just saw one right now succeed against prod:
https://app.bitrise.io/build/a2252e8546207c6e

And one fail right after (it started at the same time/commit):
https://app.bitrise.io/build/8af613a1ad569b4b

Weird. It might correlate with XPI signing load from AMO. This is also consistent with having some traffic going to an old stack from the ELB.

We also need to migrate lockbox keys to the hsm stack at some point.

I tried repeating a request with just handful of clients running watch -n1 curl ... to see if I could produce errors, but didn't, so I don't think this is an app logic problem.

I also don't see any non-201 requests in the autograph app nginx access logs for the 20th:

[gguthe@ip-172-31-31-4 raw]$ grep 'POST /sign/file' /var/log/raw/autograph.autograph.nginx.access.log | grep -v 'HTTP/1.1" 201'
[gguthe@ip-172-31-31-4 raw]$

which suggests a problem with the autograph edge to app connection or the bitrise to autograph edge connection.

:miles do you know where the non-hsm stack edge logs live? I didn't see them on the grep host.

Given that curl says * HTTP error before end of send, stop sending for all the failing builds (referenced here and in the github issue), I'm guessing a connectivity issue.

If autograph edge proxies a partial input it'll return a 502 (since the app will return 4xx or 5xx status) i.e.

$ head -c 253 test.apk > test-truncated.apk
$ curl -v -F "input=@test-truncated.apk" -H "Authorization: $LOCKBOX_PROD_TOKEN" https://autograph-edge.p
rod.mozaws.net/sign       
*   Trying 52.41.84.60...                                    
* TCP_NODELAY set                                                       
* Connected to autograph-edge.prod.mozaws.net (52.41.84.60) port 443 (#0)
...
< HTTP/1.1 502 Bad Gateway
< Cache-Control: private, no-cache, no-store, must-revalidate
< Content-Security-Policy: default-src 'none'; report-uri /__cspreport__
< Content-Type: text/plain; charset=utf-8
< Date: Wed, 20 Mar 2019 21:11:02 GMT
< X-Content-Type-Options: nosniff
< X-Frame-Options: DENY
< X-XSS-Protection: 1; mode=block
< Content-Length: 39
< Connection: keep-alive
* HTTP error before end of send, stop sending
< 
failed to call autograph for signature
* Closing connection 0
* TLSv1.2 (OUT), TLS alert, Client hello (1):
$ 

I think a connection time out would also give us a 502 and we'd get a 401 if the content hash is wrong causing a HAWK header mismatch.

:dreams can you try the curl --retry option as a workaround (maybe --retry-max-time to limit the overall time too)?

--retry <num>

If a transient error is returned when curl tries to perform a transfer, it will retry this number of times before giving up. Setting the number to 0 makes curl do no retries (which is the default). Transient error means either: a timeout, an FTP 4xx response code or an HTTP 408 or 5xx response code.

When curl is about to retry a transfer, it will first wait one second and then for all forthcoming retries it will double the waiting time until it reaches 10 minutes which then will be the delay between the rest of the retries. By using --retry-delay you disable this exponential backoff algorithm. See also --retry-max-time to limit the total time allowed for retries.

If this option is used several times, the last one will be used.

Added in 7.12.3.

https://curl.haxx.se/docs/manpage.html

If that works we should update the autograph edge and mobile docs. I opened https://github.com/mozilla-services/autograph-edge/issues/15 and will bump up the autograph app connection timeouts for our next deploy (they're currently 1 minute) too.

Flags: needinfo?(miles)

Miles raised the possibility that this is related to request size since the APKs are around ~50MB and this has been a problem before.

I bumped the hawk timestamp validity to 10 minutes from the 1 minute default (acceptable skew from when the HAWK header is created to the being read off the network and validated) in the autograph app config (hiera-sops commit 01637eb85aac6a5e4904827cafe6ec55e725d644) and tested that autograph starts against the updated configs.

Flags: needinfo?(miles)

:dreams the hawk ts validity change is deployed to stage and prod, so I'm tentatively resolving this as fixed. If you still see errors please reopen this since other APKs are probably affected too.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

Hey Greg, the size/timeout is an interesting theory. The overall app size has gone from ~40 MB over the past few months up just a bit to 47-50.

We tried adding retries and see those still fail as incomplete with a 502 thrown:
https://app.bitrise.io/build/22d478b674fbb083

 91 47.1M    0     0   91 43.0M      0  10.9M  0:00:04  0:00:03  0:00:01 10.9M< HTTP/1.1 502 Bad Gateway
< Cache-Control: private, no-cache, no-store, must-revalidate
< Content-Security-Policy: default-src 'none'; report-uri /__cspreport__
< Content-Type: text/html
< Date: Thu, 21 Mar 2019 14:48:11 GMT
< X-Content-Type-Options: nosniff
< X-Frame-Options: DENY
< X-XSS-Protection: 1; mode=block
< Content-Length: 166
< Connection: keep-alive
* HTTP error before end of send, stop sending
< 
{ [166 bytes data]

100 47.1M  100   166  100 47.1M     37  10.5M  0:00:04  0:00:04 --:--:-- 10.5M
* Closing connection 0
Warning: Transient problem: HTTP error Will retry in 1 seconds. 3 retries 
Warning: left.
Throwing away 166 bytes
* Hostname autograph-edge.prod.mozaws.net was found in DNS cache
*   Trying 52.41.84.60...

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to autograph-edge.prod.mozaws.net (52.41.84.60) port 443 (#1)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL re-using session ID
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: *.prod.mozaws.net (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: C=US,ST=California,L=Mountain View,O=Mozilla Corporation,OU=Cloud Services,CN=*.prod.mozaws.net
* 	 start date: Tue, 06 Feb 2018 00:00:00 GMT
* 	 expire date: Wed, 10 Feb 2021 12:00:00 GMT
* 	 issuer: C=US,O=DigiCert Inc,CN=DigiCert SHA2 Secure Server CA
* 	 compression: NULL
* ALPN, server did not agree to a protocol
> POST /sign HTTP/1.1
> Host: autograph-edge.prod.mozaws.net
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 49415780
> Expect: 100-continue
> Content-Type: multipart/form-data; boundary=------------------------d96a3deed35ef12c
> 
< HTTP/1.1 100 Continue
} [158 bytes data]

  0 47.1M    0     0    0  432k      0   865k  0:00:55 --:--:--  0:00:55  864k
 44 47.1M    0     0   44 20.9M      0  14.1M  0:00:03  0:00:01  0:00:02 14.1M
 99 47.1M    0     0   99 47.0M      0  18.9M  0:00:02  0:00:02 --:--:-- 18.9M< HTTP/1.1 502 Bad Gateway
< Cache-Control: private, no-cache, no-store, must-revalidate
< Content-Security-Policy: default-src 'none'; report-uri /__cspreport__
< Content-Type: text/html
< Date: Thu, 21 Mar 2019 14:48:15 GMT
< X-Content-Type-Options: nosniff
< X-Frame-Options: DENY
< X-XSS-Protection: 1; mode=block
< Content-Length: 166
< Connection: keep-alive
* HTTP error before end of send, stop sending
< 
{ [166 bytes data]

100 47.1M  100   166  100 47.1M     64  18.2M  0:00:02  0:00:02 --:--:-- 18.2M
* Closing connection 1
Warning: Transient problem: HTTP error Will retry in 2 seconds. 2 retries 
Warning: left.
Throwing away 166 bytes
* Hostname autograph-edge.prod.mozaws.net was found in DNS cache
*   Trying 52.41.84.60...

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to autograph-edge.prod.mozaws.net (52.41.84.60) port 443 (#2)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL re-using session ID
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: *.prod.mozaws.net (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: C=US,ST=California,L=Mountain View,O=Mozilla Corporation,OU=Cloud Services,CN=*.prod.mozaws.net
* 	 start date: Tue, 06 Feb 2018 00:00:00 GMT
* 	 expire date: Wed, 10 Feb 2021 12:00:00 GMT
* 	 issuer: C=US,O=DigiCert Inc,CN=DigiCert SHA2 Secure Server CA
* 	 compression: NULL
* ALPN, server did not agree to a protocol
> POST /sign HTTP/1.1
> Host: autograph-edge.prod.mozaws.net
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 49415780
> Expect: 100-continue
> Content-Type: multipart/form-data; boundary=------------------------503bf9944b05a48f
> 
< HTTP/1.1 100 Continue
} [158 bytes data]

 16 47.1M    0     0   16 7952k      0  8414k  0:00:05 --:--:--  0:00:05 8414k
 71 47.1M    0     0   71 33.7M      0  17.8M  0:00:02  0:00:01  0:00:01 17.8M< HTTP/1.1 502 Bad Gateway
< Cache-Control: private, no-cache, no-store, must-revalidate
< Content-Security-Policy: default-src 'none'; report-uri /__cspreport__
< Content-Type: text/html
< Date: Thu, 21 Mar 2019 14:48:19 GMT
< X-Content-Type-Options: nosniff
< X-Frame-Options: DENY
< X-XSS-Protection: 1; mode=block
< Content-Length: 166
< Connection: keep-alive
* HTTP error before end of send, stop sending
< 
{ [166 bytes data]

100 47.1M  100   166  100 47.1M     67  19.1M  0:00:02  0:00:02 --:--:-- 19.1M
* Closing connection 2
Warning: Transient problem: HTTP error Will retry in 4 seconds. 1 retries 
Warning: left.
Throwing away 166 bytes
* Hostname autograph-edge.prod.mozaws.net was found in DNS cache
*   Trying 52.41.84.60...

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to autograph-edge.prod.mozaws.net (52.41.84.60) port 443 (#3)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL re-using session ID
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: *.prod.mozaws.net (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: C=US,ST=California,L=Mountain View,O=Mozilla Corporation,OU=Cloud Services,CN=*.prod.mozaws.net
* 	 start date: Tue, 06 Feb 2018 00:00:00 GMT
* 	 expire date: Wed, 10 Feb 2021 12:00:00 GMT
* 	 issuer: C=US,O=DigiCert Inc,CN=DigiCert SHA2 Secure Server CA
* 	 compression: NULL
* ALPN, server did not agree to a protocol
> POST /sign HTTP/1.1
> Host: autograph-edge.prod.mozaws.net
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 49415780
> Expect: 100-continue
> Content-Type: multipart/form-data; boundary=------------------------eb2852e719a6943b
> 
< HTTP/1.1 100 Continue
} [158 bytes data]

  0 47.1M    0     0    0  224k      0   504k  0:01:35 --:--:--  0:01:35  503k
 36 47.1M    0     0   36 17.0M      0  12.0M  0:00:03  0:00:01  0:00:02 12.0M
 74 47.1M    0     0   74 34.9M      0  14.5M  0:00:03  0:00:02  0:00:01 14.5M< HTTP/1.1 502 Bad Gateway
< Cache-Control: private, no-cache, no-store, must-revalidate
< Content-Security-Policy: default-src 'none'; report-uri /__cspreport__
< Content-Type: text/html
< Date: Thu, 21 Mar 2019 14:48:27 GMT
< X-Content-Type-Options: nosniff
< X-Frame-Options: DENY
< X-XSS-Protection: 1; mode=block
< Content-Length: 166
< Connection: keep-alive
* HTTP error before end of send, stop sending
< 
{ [166 bytes data]

100 47.1M  100   166  100 47.1M     51  14.4M  0:00:03  0:00:03 --:--:-- 14.4M
* Closing connection 3
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

(In reply to Devin Reams (dreams) from comment #10)

Hey Greg, the size/timeout is an interesting theory. The overall app size has gone from ~40 MB over the past few months up just a bit to 47-50.

That's interesting. Are these first failures?

3/14 09:19 (401) https://app.bitrise.io/build/963a93749efe072c
3/13 15:46 (502) https://app.bitrise.io/build/279a156a264f6934

Has the error rate gone up gradually or just when we hit ~47MB?

I'll check to see what size the other APKs we sign with edge are.

The 401s could be coming from a hawk TS timeout from autograph-edge. Edge also naively buffers the whole file, so streaming the request body might be preferable: https://github.com/mozilla-services/autograph-edge/blob/2310a9b70edf46b625dc8687c2a7e07085cc38c1/main.go#L132

The APK for Firefox Send is 97MB and goes through edge fine. They don't use bitrise for signing though (or any CI).

:dreams are you open to switching the Lockbox apk to AAB signing? If we're only distributing on the Play store, this takes autograph out of the signing path. (we tested this out with the Rocket / Fx Lite team in https://bugzilla.mozilla.org/show_bug.cgi?id=1473845)

Other benefits:

  • v2 APK signatures that prevent malware injection
  • google handles the signing key so we can rotate it
  • per system optimized builds (so smaller initial APKs to install)
  • can split out features and libs to download separately (e.g. langpacks)

Limitations / potential downsides:

  • only applies to distribution through the Play store
  • google handles the signing key, so they could sign things

Also, it's important that bitrise only has access to a Play store API key with permission to upload new releases, but not enable distributing them (since we don't want a bitrise compromise to result in distributing malware to users).

There's a good walkthrough at: https://medium.com/mindorks/securing-and-optimizing-your-app-with-google-play-app-signing-24a3658fd319

If you want to try AAB, I'll need the pepk.jar and google public key ID you want me to sign our private key with.

Flags: needinfo?(dreams)

I'd also recommend testing the signing locally and working with bitrise if the problem comes from them.

I'd also recommend testing the signing locally and working with bitrise if the problem comes from them.

Yeah good suggestion, I remember using unsigned files locally with mixed success (i.e.: I have seen at least one 502 locally).

That said, no worries, its successful enough for us to be more graceful and not hard fail on our side. We'll aim to work around this for a little bit until...

:dreams are you open to switching the Lockbox apk to AAB signing?

Yes! Definitely wanted to get us moving this direction (since your email in January "standardizing AAB signing for Lockbox").

I will pick back up there after our launch and see how we can re-tool things for a better and more stable solution.

Do you want to close this issue for now then?

Flags: needinfo?(dreams)

(In reply to Devin Reams (dreams) from comment #15)

Do you want to close this issue for now then?

yeah, let's revisit after launch.

I added the keys for the lockbox addon to the config for the HSM stack in autograph-hiera-sops commit 732b595a3776b0d08fd997e8664985d38c8d3469 The HSM stack will probably produce the same errors, but it's easier for me to inspect the logs there at least.

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Depends on: 1538203
Resolution: --- → FIXED

Following up, we redeployed the Lockbox Android key to the HSM stack. It uses the same creds from https://edge.prod.autograph.services.mozaws.net/sign and should be more stable:

» curl -F "input=@test.apk" -o test-signed.apk -H "Authorization: $LOCKBOX_EDGE" https://edge.prod.autograph.services.mozaws.net/sign
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14390    0  8564  100  5826  21736  14786 --:--:-- --:--:-- --:--:-- 36430
» apksigner verify --print-certs --verbose test-signed.apk
Verifies
Verified using v1 scheme (JAR signing): true
Verified using v2 scheme (APK Signature Scheme v2): false
Number of signers: 1
Signer #1 certificate DN: CN=Release Engineering, OU=Release Engineering, O=Mozilla Corporation, L=Mountain View, ST=California, C=US
Signer #1 certificate SHA-256 digest: 64d26b507078deba2fee42d6bd0bfad41d39ffc4e791f281028e5e73d3c8d2f2
Signer #1 certificate SHA-1 digest: 778a92ea4ba1c6fe1b883108b8b9f832d6fd8ae2
Signer #1 certificate MD5 digest: e51c31dd0a2652cc591be6f690aa6f02
Signer #1 key algorithm: RSA
Signer #1 key size (bits): 2048
Signer #1 public key SHA-256 digest: 42b03e9c46258be9c3d369bd6a8ed25376723f6ed884cf93cdd2dc3011162251
Signer #1 public key SHA-1 digest: 6b3d026ee14d5f01c1116d59603020370eddf0e8
Signer #1 public key MD5 digest: 22d2b18899aa4f3e1ca3bf86ba8a2164
You need to log in before you can comment on or make changes to this bug.