When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)`
Bug 1641103 Comment 13 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer) 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer) 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors.
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer) 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ```
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer) 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer) 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful.
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in go routines seems like a problem.
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in goroutines seems like a problem. With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with `CKR_ARGUMENTS_BAD` all the times I tried. I also saw an error from fetching the a key: ``` {"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}} ``` so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot).
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in goroutines seems like a problem. With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with `CKR_ARGUMENTS_BAD` all the times I tried. I also saw an error from fetching the a key: ``` {"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}} ``` so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot). This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it.
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) possibly with XPI EE cert reuse https://github.com/mozilla-services/autograph/issues/237 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in goroutines seems like a problem. With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with `CKR_ARGUMENTS_BAD` all the times I tried. I also saw an error from fetching the a key: ``` {"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}} ``` so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot). This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it.
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) possibly with XPI EE cert reuse https://github.com/mozilla-services/autograph/issues/237 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in goroutines seems like a problem. With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with `CKR_ARGUMENTS_BAD` all the times I tried. I also saw an error from fetching the a key: ``` {"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}} ``` so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot). This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it. This could also be: > Very rapid session opening can trigger the following error: > > C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 > HSM error 8c: HSM Error: Already maximum number of sessions are issued from https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm but we don't have the full error message.
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) possibly with XPI EE cert reuse https://github.com/mozilla-services/autograph/issues/237 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in goroutines seems like a problem. With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with `CKR_ARGUMENTS_BAD` all the times I tried. I also saw an error from fetching the a key: ``` {"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}} ``` so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot). This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it. This could also be: > Very rapid session opening can trigger the following error: > > C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 > HSM error 8c: HSM Error: Already maximum number of sessions are issued from https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm but we don't have the full error message. If that were the case, we'd expect almost all failures past a certain threshold.
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) possibly with XPI EE cert reuse https://github.com/mozilla-services/autograph/issues/237 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in goroutines seems like a problem. With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with `CKR_ARGUMENTS_BAD` all the times I tried. I also saw an error from fetching the a key: ``` {"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}} ``` so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot). This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it. This could also be: > Very rapid session opening can trigger the following error: > > C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 > HSM error 8c: HSM Error: Already maximum number of sessions are issued from https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm but we don't have the full error message. If that were the case, we'd expect almost all failures past a certain threshold. This could actually be the issue running in a container with 30 generators and without redirecting to stdout I see: ``` $ autograph -c config.yaml ... {"Timestamp":1604419874512978880,"Time":"2020-11-03T16:11:14Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":6,"Fields":{"msg":"contentsignaturepki \"onecrl\": reusing existing EE \"onecrl-20201026205928\""}} HSM error 8c: HSM Error: Already maximum number of sessions are issued C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 HSM error 8c: HSM Error: Already maximum number of sessions are issued C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 {"Timestamp":1604419875396468817,"Time":"2020-11-03T16:11:15Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":2,"Fields":{"msg":"xpi: error generating RSA key for cache: \"pkcs11: 0x7: CKR_ARGUMENTS_BAD\""}} ```
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) possibly with XPI EE cert reuse https://github.com/mozilla-services/autograph/issues/237 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in goroutines seems like a problem. With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with `CKR_ARGUMENTS_BAD` all the times I tried. I also saw an error from fetching the a key: ``` {"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}} ``` so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot). This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it. This could also be: > Very rapid session opening can trigger the following error: > > C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 > HSM error 8c: HSM Error: Already maximum number of sessions are issued from https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm but we don't have the full error message. If that were the case, we'd expect almost all failures past a certain threshold. This could actually be the issue running in a container with 30 generators and without redirecting to stdout I see: ``` $ autograph -c config.yaml ... {"Timestamp":1604419874512978880,"Time":"2020-11-03T16:11:14Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":6,"Fields":{"msg":"contentsignaturepki \"onecrl\": reusing existing EE \"onecrl-20201026205928\""}} HSM error 8c: HSM Error: Already maximum number of sessions are issued C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 HSM error 8c: HSM Error: Already maximum number of sessions are issued C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 {"Timestamp":1604419875396468817,"Time":"2020-11-03T16:11:15Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":2,"Fields":{"msg":"xpi: error generating RSA key for cache: \"pkcs11: 0x7: CKR_ARGUMENTS_BAD\""}} ``` CloudHSM doens't have a document PKCS11 session limit (or I couldn't find one), but https://docs.aws.amazon.com/cloudhsm/latest/userguide/limits.html does have does have "Maximum number of concurrent clients 900" and [crypto11 defaults to 1024 sessions](https://github.com/ThalesIgnite/crypto11/blob/c73933259cb60509d00f32306eea53d10f8e8f10/crypto11.go#L93-L95 ). Indeed setting `maxsessions: 64` in the HSM part of the autograph config allowed autograph to start even after bumping up the number of generators to 255 (imposed by [the uint8 config type](https://github.com/mozilla-services/autograph/blob/master/signer/signer.go#L46-L48)).
When the XPI signers and their creds are commented out stage runs and the monitor passes (it doesn't check APK2, gpg, or gpg2 signatures), which is consistent with this part of [the transcript Hal saved in comment 11](https://bugzilla.mozilla.org/attachment.cgi?id=9183488): `it's failing at ./signer/xpi/x509.go:33 which is trying to call new(crypto11.PKCS11RandReader)` So we have a few options: 1. disable use of the HSM RNG in favor of the golang RNG (currently if the key is in the HSM autograph will try to use the HSM RNG for that signer edit: that's not right) possibly with XPI EE cert reuse https://github.com/mozilla-services/autograph/issues/237 1. generate and clean up XPI EE certs in the HSM (might run into resource limitations; requires giving autograph user perms to create keys in the HSM, complicates HSM management) 1. debug and fix the pkcs11 error I'm going to try 3. / debugging it but we can always fall back to 1. / removing the HSM RNG if that fails and we're running out of time. NB: OpenSSL dynamic engine does not support offloading RNG without opening a support case https://docs.aws.amazon.com/cloudhsm/latest/userguide/ki-openssl-sdk.html#ki-openssl-2 but that doesn't seem to be the case for PKCS11 To debug the RNG we'll want to check each layer of the calls stack: * https://github.com/ThalesIgnite/crypto11 provides the golang `crypto.Signer` and other language standard interfaces * https://github.com/miekg/pkcs11 low-level wrapper of the PKCS11 C API * https://cryptsoft.com/pkcs11doc/v220/pkcs11__all_8h.html#aC_GenerateRandom C_GenerateRandom the main C call we're interested in Per https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm it was only tested against the CloudHSM 2 libraries, so that might be source of the errors. We're using [a rather old version v0.1.0](https://github.com/ThalesIgnite/crypto11/tree/v1.1.0) of ThalesIgnite/crypto11 and upgrading to any newer release 1+ would introduce [breaking changes](https://github.com/ThalesIgnite/crypto11/issues/36) including losing access to the miekg/pkcs11 context that we use to initial the HSM RNG. Running the `ThalesIgnite/crypto11@0.1.0` rand test (clone their repo, checkout 0.1.0, update the test `config` file with the stage HSM config) results in this error which looks more helpful: ```console go test -run TestRandomReader 2020/10/29 19:51:50 Could decode config file: invalid character 'c' after object key --- FAIL: TestRandomReader (0.00s) require.go:794: Error Trace: rand_test.go:34 Error: Received unexpected error: invalid character 'c' after object key Test: TestRandomReader FAIL exit status 1 FAIL github.com/ThalesIgnite/crypto11 0.004s ``` nm that error is from crypto11 trying to configure itself * Using `pkcs11.Configure` like autograph does causes the test to pass * tests still pass with `new(PKCS11RandReader)` and multiple calls to it So it's probably something autograph specific and debug the calls it makes with the failing config might be more useful. The cache keygen from errors in goroutines seems like a problem. With one XPI signer and one XPI RSA key generator the app started most of the time, but increasing the generators to 30 (increasing the number of XPI signers would probably cause similar problems) caused it to fail with `CKR_ARGUMENTS_BAD` all the times I tried. I also saw an error from fetching the a key: ``` {"Timestamp":1604005730281323793,"Time":"2020-10-29T21:08:50Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":1,"Severity":2,"Fields":{"msg":"failed to add signer \"normandy\": contentsignaturepki \"normandy\": failed to get keys: pkcs11: 0x7: CKR_ARGUMENTS_BAD"}} ``` so I think this is related to concurrent use of the pkcs11 context and not anything particular to the RNG (we see that call fail the most since it's the most frequent call on boot). This might explain the HSM disconnect and intermittent 502s in bug 1563796 (and other bugs) we saw earlier (app fails to start on boot and ASG predict scale up properly) and why :bobm consolidating onto fewer permanent larger nodes fixed it. This could also be: > Very rapid session opening can trigger the following error: > > C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 > HSM error 8c: HSM Error: Already maximum number of sessions are issued from https://github.com/ThalesIgnite/crypto11#testing-with-aws-cloudhsm but we don't have the full error message. If that were the case, we'd expect almost all failures past a certain threshold. This could actually be the issue running in a container with 30 generators and without redirecting to stdout I see: ``` $ autograph -c config.yaml ... {"Timestamp":1604419874512978880,"Time":"2020-11-03T16:11:14Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":6,"Fields":{"msg":"contentsignaturepki \"onecrl\": reusing existing EE \"onecrl-20201026205928\""}} HSM error 8c: HSM Error: Already maximum number of sessions are issued C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 HSM error 8c: HSM Error: Already maximum number of sessions are issued C_OpenSession failed with error CKR_ARGUMENTS_BAD : 0x00000007 {"Timestamp":1604419875396468817,"Time":"2020-11-03T16:11:15Z","Type":"app.log","Logger":"autograph","Hostname":"ip-172-31-22-191","EnvVersion":"2.0","Pid":463,"Severity":2,"Fields":{"msg":"xpi: error generating RSA key for cache: \"pkcs11: 0x7: CKR_ARGUMENTS_BAD\""}} ``` CloudHSM doens't have a documented PKCS11 session limit (or I couldn't find one), but https://docs.aws.amazon.com/cloudhsm/latest/userguide/limits.html does have does have "Maximum number of concurrent clients 900" and [crypto11 defaults to 1024 sessions](https://github.com/ThalesIgnite/crypto11/blob/c73933259cb60509d00f32306eea53d10f8e8f10/crypto11.go#L93-L95 ). Indeed setting `maxsessions: 64` in the HSM part of the autograph config allowed autograph to start even after bumping up the number of generators to 255 (imposed by [the uint8 config type](https://github.com/mozilla-services/autograph/blob/master/signer/signer.go#L46-L48)).