1641103 - Base image changes for Cloud HSM suport packages

Assignee

Description

•

5 years ago

tl;dr: we should also install the CloudHSM dynamic engine for OpenSSL on our base image.

There are a few manual operations that currently require the manual installation of the AWS CloudHSM engine. Since AWS only provides "latest" links, if our base image is "old", there are a number of other steps needed. (Our current trouble shooting instructions cover this case.)

Adding the package does not affect the security of the image.

Greg G

Comment 1

•

5 years ago

Are we using CentOS AMIs on our EC2 instances? If so, are we on Centos 6 or 7?

Do we still mount crypto libs into the container (using a debian base IIRC)? Will those still work with the upgrade?

hwine

Assignee

Comment 2

•

5 years ago

Sorry for the confusion -- I am not proposing updating the OS base image, just the CloudHSM software base installs.

Once the build manifest is updated, we have to generate a "new image", but that's just part of the normal deploy process.

Greg G

Comment 3

•

5 years ago

OK thanks!

So I think we're talking about roughly the following changes to the deploy pipeline and base docker image: https://github.com/mozilla-services/autograph/blob/ee2ac447c6129ee485a9b48bba3d356fa286452a/Dockerfile

Update rpm url in https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/autograph/HSM.md#install-the-client-manually to one for the dynamic engine in the AWS docs:

wget https://s3.amazonaws.com/cloudhsmv2-software/CloudHsmClient/EL7/cloudhsm-client-dyn-latest.el7.x86_64.rpm
sudo yum install -y ./cloudhsm-client-dyn-latest.el7.x86_64.rpm

and bump the associated package versions on https://github.com/mozilla-services/cloudops-deployment/blob/1c6cda3b728a947aed2627df24d3c7a0470d8268/projects/autograph/puppet/modules/autograph/manifests/hsm.pp#L9

update the mounted libraries (add /opt/cloudhsm/lib/libcloudhsm_openssl.so ?) and confirm none missing with ldd at https://github.com/mozilla-services/cloudops-deployment/blob/1c6cda3b728a947aed2627df24d3c7a0470d8268/projects/autograph/puppet/modules/autograph/manifests/app.pp#L52-L56 (sidenote: I believe we wanted to mount fewer of them at some point)
update directions on https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/autograph/HSM.md#configure-and-start-cloudhsm-client
update the HSM client cert in sops?

hwine

Assignee

Comment 4

•

5 years ago

(In reply to Greg Guthe [:g-k] [:gguthe] from comment #3)

So I think we're talking about roughly the following changes to the deploy pipeline and base docker image: https://github.com/mozilla-services/autograph/blob/ee2ac447c6129ee485a9b48bba3d356fa286452a/Dockerfile

There are no changes to the base docker image. I'm confused that it is even useful, as it is Debian based, but the production container is EL7 based.

Update rpm url in https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/autograph/HSM.md#install-the-client-manually to one for the dynamic engine in the AWS docs:

This is in additition to the existing rpm URL (not an update of the existing lines)

update the mounted libraries (add /opt/cloudhsm/lib/libcloudhsm_openssl.so ?)

This is not needed -- this package is not needed for the autograph app. It is only needed for manual CLI invocations of openssl after the instance has been removed from the autograph cluster.

update the HSM client cert in sops?

This should not be affected.

Bob Micheletto [:bobm]

Comment 5

•

5 years ago

:hwine is there anything for me to do here? Or can we close this out, or re-assign?

Flags: needinfo?(hwine)

hwine

Assignee

Comment 6

•

5 years ago

(In reply to Bob Micheletto [:bobm] from comment #5)

:hwine is there anything for me to do here? Or can we close this out, or re-assign?

I just took the bug. The next steps, afaik, are:

[ ] tweak the container build manifest to also install the OpenSSL shim
[ ] test in staging
[ ] use in production

Assignee: bobm → hwine

Flags: needinfo?(hwine)

hwine

Assignee

Comment 7

•

5 years ago

:bobm -- sorry, I was wrong -- I don't have the chops to test this.

But I did make a PR

Assignee: hwine → bobm

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Updated

•

5 years ago

Status: NEW → ASSIGNED

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Comment 8

•

5 years ago

I added the latest versions of the RPMs to the S3 yum repo, and bumped the version in the manifest in the PR. Running a stage deployment to test presently.

Bob Micheletto [:bobm]

Comment 9

•

5 years ago

After testing it was revealed that the latest version of these packages causes the application to fail because of changes in an internal API. Re-assigning this bug to :hwine to make this a procedural documentation change.

Assignee: bobm → hwine

Greg G

Comment 10

•

5 years ago

Thread from debugging stage https://mozilla.slack.com/archives/CEMMGTZJ5/p1596663741013000

hwine

Assignee

Comment 11

•

5 years ago

Attached file slack conversation text from #autograph (public) — Details

(In reply to Greg Guthe [:g-k] [:gguthe] from comment #10)

Thread from debugging stage https://mozilla.slack.com/archives/CEMMGTZJ5/p1596663741013000

Capturing thread before it's purged

hwine

Assignee

Comment 12

•

5 years ago

Attached file shell script to upgrade cloudHSM rpms — Details

Greg G

Comment 13

•

5 years ago

•

Edited

When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of the transcript Hal saved in comment 11:

it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)

So we have a few options:

disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) possibly with XPI EE cert reuse https://github.com/mozilla-services/autograph/issues/237
generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management)
debug and fix the pkcs11 error

I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time.

NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11

To debug the RNG we'll want to check each layer of the calls stack:

https://github.com/ThalesIgnite/crypto11 provides the golang crypto.Signer and other language standard interfaces
https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API
https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in

Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors.

We're using a rather old version v0.1.0 of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce breaking changes including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG.

Running the ThalesIgnite/crypto11@0.1.0 rand test (clone their repo, checkout 0.1.0, update the test config file with the stage HSM config) results in this error which looks more helpful:

go test -run TestRandomReader
2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key
--- FAIL: TestRandomReader (0.00s)
    require.go:794: 
                Error Trace:    rand_test.go:34
                Error:          Received unexpected error:
                                invalid character 'c' after object key
                Test:           TestRandomReader
FAIL
exit status 1
FAIL    github.com/ThalesIgnite/crypto11        0.004s

nm that error is from crypto11 trying to configure itself

Using pkcs11.Configure like autograph does causes the test to pass
tests still pass with new(PKCS11RandReader) and multiple calls to it

So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful.

The cache keygen from errors in goroutines seems like a problem.

With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with CKR_ARGUMENTS_BAD all the times I tried. I also saw an error from fetching the a key:

{"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}}

so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot).

This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it.

This could also be:

Very rapid session opening can trigger the following error:

C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007
HSM error 8c: HSM Error: Already maximum number of sessions are issued

from https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm but we don't have the full error message. If that were the case, we'd expect almost all failures past a certain threshold.

This could actually be the issue running in a container with 30 generators and without redirecting to stdout I see:

$ autograph -c config.yaml
...
{"Timestamp":1604419874512978880,"Time":"2020-11-03T16:11:14Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":6,"Fields":{"msg":"contentsignaturepki \"onecrl\": reusing existing EE \"onecrl-20201026205928\""}}
HSM error 8c: HSM Error: Already maximum number of sessions are issued

C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007
HSM error 8c: HSM Error: Already maximum number of sessions are issued

C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007
{"Timestamp":1604419875396468817,"Time":"2020-11-03T16:11:15Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":2,"Fields":{"msg":"xpi: error generating RSA key for cache: \"pkcs11: 0x7: CKR_ARGUMENTS_BAD\""}}

CloudHSM doens't have a documented PKCS11 session limit (or I couldn't find one), but https://docs.aws.amazon.com/cloudhsm/latest/userguide/limits.html does have does have "Maximum number of concurrent clients 900" and crypto11 defaults to 1024 sessions.

Indeed setting maxsessions: 64 in the HSM part of the autograph config allowed autograph to start even after bumping up the number of generators to 255 (imposed by the uint8 config type).

Greg G

Comment 14

•

5 years ago

Attached file debug script I tried to reproduce gorountine concurrency issues with — Details

also tried looking at the HSM audit logs

Greg G

Comment 15

•

5 years ago

•

Edited

So the fix is to add a more reasonable maxsessions to the autograph stage and prod app configs.

Additional links from debugging:

Going forward it would be nice to:

upgrade a newer version of crypto11
and capture the additional logs from https://bugzilla.mozilla.org/show_bug.cgi?id=1641103#c13 However, those logs are coming from the cloudhsm library itself so that doesn't look easy:

[gguthe@ip-172-31-22-191 ~]$ strings /opt/cloudhsm/lib/libcloudhsm_pkcs11.so  | grep 'failed with error'
%s failed with error %s : 0x%08lx
 Delete partition failed with error code [%d] !!
 Resize partition failed with error code [%d] !!
 Create partition failed with error code [%d]!!
        Validation of template is failed with error 0x%x
[gguthe@ip-172-31-22-191 ~]$ strings /opt/cloudhsm/lib/libcloudhsm_pkcs11.so  | grep 'HSM error'
HSM error %lx: %s

There are some outdated headers on https://github.com/aws-samples/aws-cloudhsm-pkcs11-examples/tree/master/include/pkcs11/v2.40 but we'd probably have to go to cavium to find the source (if it's publicly available at all).

hwine

Assignee

Comment 16

•

5 years ago

Update: the PR was closed without merging.

All work related to the HSM client libraries breaking autograph are really covered in bug 1672287. Once that is fixed, the PR can be recreated.

hwine

Assignee

Comment 17

•

5 years ago

PR has been recreated.

Greg G

Comment 18

•

5 years ago

Everything is landed and updated AFAIK.

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

slack conversation text from #autograph (public) 5 years ago hwine 4.63 KB, text/plain		Details
shell script to upgrade cloudHSM rpms 5 years ago hwine 741 bytes, text/plain		Details
debug script I tried to reproduce gorountine concurrency issues with 5 years ago Greg G 2.84 KB, text/plain		Details