Base image changes for Cloud HSM suport packages
Categories
(Cloud Services :: Operations: Autograph, task)
Tracking
(Not tracked)
People
(Reporter: hwine, Assigned: hwine)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
tl;dr: we should also install the CloudHSM dynamic engine for OpenSSL on our base image.
There are a few manual operations that currently require the manual installation of the AWS CloudHSM engine. Since AWS only provides "latest" links, if our base image is "old", there are a number of other steps needed. (Our current trouble shooting instructions cover this case.)
Adding the package does not affect the security of the image.
Are we using CentOS AMIs on our EC2 instances? If so, are we on Centos 6 or 7?
Do we still mount crypto libs into the container (using a debian base IIRC)? Will those still work with the upgrade?
Sorry for the confusion -- I am not proposing updating the OS base image, just the CloudHSM software base installs.
Once the build manifest is updated, we have to generate a "new image", but that's just part of the normal deploy process.
OK thanks!
So I think we're talking about roughly the following changes to the deploy pipeline and base docker image: https://github.com/mozilla-services/autograph/blob/ee2ac447c6129ee485a9b48bba3d356fa286452a/Dockerfile
- Update rpm url in https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/autograph/HSM.md#install-the-client-manually to one for the dynamic engine in the AWS docs:
wget https://s3.amazonaws.com/cloudhsmv2-software/CloudHsmClient/EL7/cloudhsm-client-dyn-latest.el7.x86_64.rpm
sudo yum install -y ./cloudhsm-client-dyn-latest.el7.x86_64.rpm
and bump the associated package versions on https://github.com/mozilla-services/cloudops-deployment/blob/1c6cda3b728a947aed2627df24d3c7a0470d8268/projects/autograph/puppet/modules/autograph/manifests/hsm.pp#L9
-
update the mounted libraries (add
/opt/cloudhsm/lib/libcloudhsm_openssl.so
?) and confirm none missing with ldd at https://github.com/mozilla-services/cloudops-deployment/blob/1c6cda3b728a947aed2627df24d3c7a0470d8268/projects/autograph/puppet/modules/autograph/manifests/app.pp#L52-L56 (sidenote: I believe we wanted to mount fewer of them at some point) -
update directions on https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/autograph/HSM.md#configure-and-start-cloudhsm-client
-
update the HSM client cert in sops?
(In reply to Greg Guthe [:g-k] [:gguthe] from comment #3)
So I think we're talking about roughly the following changes to the deploy pipeline and base docker image: https://github.com/mozilla-services/autograph/blob/ee2ac447c6129ee485a9b48bba3d356fa286452a/Dockerfile
There are no changes to the base docker image. I'm confused that it is even useful, as it is Debian based, but the production container is EL7 based.
- Update rpm url in https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/autograph/HSM.md#install-the-client-manually to one for the dynamic engine in the AWS docs:
This is in additition to the existing rpm URL (not an update of the existing lines)
- update the mounted libraries (add
/opt/cloudhsm/lib/libcloudhsm_openssl.so
?)
This is not needed -- this package is not needed for the autograph app. It is only needed for manual CLI invocations of openssl after the instance has been removed from the autograph cluster.
- update the HSM client cert in sops?
This should not be affected.
Comment 5•5 years ago
|
||
:hwine is there anything for me to do here? Or can we close this out, or re-assign?
(In reply to Bob Micheletto [:bobm] from comment #5)
:hwine is there anything for me to do here? Or can we close this out, or re-assign?
I just took the bug. The next steps, afaik, are:
- [ ] tweak the container build manifest to also install the OpenSSL shim
- [ ] test in staging
- [ ] use in production
:bobm -- sorry, I was wrong -- I don't have the chops to test this.
But I did make a PR
Updated•5 years ago
|
Comment 8•5 years ago
|
||
I added the latest versions of the RPMs to the S3 yum repo, and bumped the version in the manifest in the PR. Running a stage deployment to test presently.
Comment 9•5 years ago
|
||
After testing it was revealed that the latest version of these packages causes the application to fail because of changes in an internal API. Re-assigning this bug to :hwine to make this a procedural documentation change.
Comment 10•5 years ago
|
||
Thread from debugging stage https://mozilla.slack.com/archives/CEMMGTZJ5/p1596663741013000
Assignee | ||
Comment 11•5 years ago
|
||
(In reply to Greg Guthe [:g-k] [:gguthe] from comment #10)
Thread from debugging stage https://mozilla.slack.com/archives/CEMMGTZJ5/p1596663741013000
Capturing thread before it's purged
Assignee | ||
Comment 12•5 years ago
|
||
Comment 13•5 years ago
•
|
||
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of the transcript Hal saved in comment 11:
it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)
So we have a few options:
- disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) possibly with XPI EE cert reuse https://github.com/mozilla-services/autograph/issues/237
- generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management)
- debug and fix the pkcs11 error
I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time.
NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11
To debug the RNG we'll want to check each layer of the calls stack:
- https://github.com/ThalesIgnite/crypto11 provides the golang
crypto.Signer
and other language standard interfaces - https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API
- https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in
Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors.
We're using a rather old version v0.1.0 of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce breaking changes including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG.
Running the ThalesIgnite/crypto11@0.1.0
rand test (clone their repo, checkout 0.1.0, update the test config
file with the stage HSM config) results in this error which looks more helpful:
go test -run TestRandomReader
2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key
--- FAIL: TestRandomReader (0.00s)
require.go:794:
Error Trace: rand_test.go:34
Error: Received unexpected error:
invalid character 'c' after object key
Test: TestRandomReader
FAIL
exit status 1
FAIL github.com/ThalesIgnite/crypto11 0.004s
nm that error is from crypto11 trying to configure itself
- Using
pkcs11.Configure
like autograph does causes the test to pass - tests still pass with
new(PKCS11RandReader)
and multiple calls to it
So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful.
The cache keygen from errors in goroutines seems like a problem.
With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with CKR_ARGUMENTS_BAD
all the times I tried. I also saw an error from fetching the a key:
{"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}}
so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot).
This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it.
This could also be:
Very rapid session opening can trigger the following error:
C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007
HSM error 8c: HSM Error: Already maximum number of sessions are issued
from https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm but we don't have the full error message. If that were the case, we'd expect almost all failures past a certain threshold.
This could actually be the issue running in a container with 30 generators and without redirecting to stdout I see:
$ autograph -c config.yaml
...
{"Timestamp":1604419874512978880,"Time":"2020-11-03T16:11:14Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":6,"Fields":{"msg":"contentsignaturepki \"onecrl\": reusing existing EE \"onecrl-20201026205928\""}}
HSM error 8c: HSM Error: Already maximum number of sessions are issued
C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007
HSM error 8c: HSM Error: Already maximum number of sessions are issued
C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007
{"Timestamp":1604419875396468817,"Time":"2020-11-03T16:11:15Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":2,"Fields":{"msg":"xpi: error generating RSA key for cache: \"pkcs11: 0x7: CKR_ARGUMENTS_BAD\""}}
CloudHSM doens't have a documented PKCS11 session limit (or I couldn't find one), but https://docs.aws.amazon.com/cloudhsm/latest/userguide/limits.html does have does have "Maximum number of concurrent clients 900" and crypto11 defaults to 1024 sessions.
Indeed setting maxsessions: 64
in the HSM part of the autograph config allowed autograph to start even after bumping up the number of generators to 255 (imposed by the uint8 config type).
Comment 14•5 years ago
|
||
also tried looking at the HSM audit logs
Comment 15•5 years ago
•
|
||
So the fix is to add a more reasonable maxsessions to the autograph stage and prod app configs.
Additional links from debugging:
- https://github.com/ThalesIgnite/crypto11/issues/36
- https://github.com/ThalesIgnite/crypto11/releases/tag/v0.1.0
- https://github.com/ThalesIgnite/crypto11/pull/59
- https://godoc.org/gopkg.in/ThalesIgnite/crypto11.v0 (v1.x at https://pkg.go.dev/github.com/ThalesIgnite/crypto11#hdr-Sessions_and_concurrency)
- https://python-pkcs11.readthedocs.io/en/latest/applied.html#concepts-in-pkcs-11
Going forward it would be nice to:
- upgrade a newer version of crypto11
- and capture the additional logs from https://bugzilla.mozilla.org/show_bug.cgi?id=1641103#c13 However, those logs are coming from the cloudhsm library itself so that doesn't look easy:
[gguthe@ip-172-31-22-191 ~]$ strings /opt/cloudhsm/lib/libcloudhsm_pkcs11.so | grep 'failed with error'
%s failed with error %s : 0x%08lx
Delete partition failed with error code [%d] !!
Resize partition failed with error code [%d] !!
Create partition failed with error code [%d]!!
Validation of template is failed with error 0x%x
[gguthe@ip-172-31-22-191 ~]$ strings /opt/cloudhsm/lib/libcloudhsm_pkcs11.so | grep 'HSM error'
HSM error %lx: %s
There are some outdated headers on https://github.com/aws-samples/aws-cloudhsm-pkcs11-examples/tree/master/include/pkcs11/v2.40 but we'd probably have to go to cavium to find the source (if it's publicly available at all).
Assignee | ||
Comment 16•5 years ago
|
||
Update: the PR was closed without merging.
All work related to the HSM client libraries breaking autograph are really covered in bug 1672287. Once that is fixed, the PR can be recreated.
Assignee | ||
Comment 17•5 years ago
|
||
PR has been recreated.
Comment 18•5 years ago
|
||
Everything is landed and updated AFAIK.
Description
•