Closed Bug 1472860 Opened 6 years ago Closed 6 years ago

enable mdc1+mdc2 signing servers

Categories

(Release Engineering :: Release Automation: Signing, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: mozilla)

References

Details

(Whiteboard: [stockwell disable-recommended])

Attachments

(3 files)

We have these in DNS and I believe spun up / moved.
We need to make sure they work, and enable them.
Related: bug 1374787 - dep signing servers. Not a blocker, but if we're doing work here, we could also do this.
See Also: → 1374787
I was able to get depsigning to work from depsigning-worker1 to signing7 :)
The only real snag was failing to get widevine signing to work (bailed on "bad passphrase") until I updated the tools clone.
Should be easy enough to get the various linux instances up in mdc1 and added to the signing passwords list.

I still need to do that, plus get the various mac signing servers up (bug 1403674).

Dave, is mdc2 ready yet, or should I hold off?
Flags: needinfo?(dhouse)
Aki, please go ahead with mdc2. mdc2 linux signing servers were set up through bug 1443291.
mdc2
signing10 10.51.48.29
signing11 10.51.48.30
signing12 10.51.48.31
Flags: needinfo?(dhouse)
Looks like we need a new ssl cert for the mdc2 servers.

aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host signing10.srv.releng.mdc2.mozilla.com:9110 ssl:True [CertificateError: ("hostname 'signing10.srv.releng.mdc2.mozilla.com' doesn't match either of 'signing4.srv.releng.scl3.mozilla.com', 'signing5.srv.releng.scl3.mozilla.com', 'signing6.srv.releng.scl3.mozilla.com', 'mac-v2-signing1.srv.releng.scl3.mozilla.com', 'mac-v2-signing2.srv.releng.scl3.mozilla.com', 'mac-v2-signing3.srv.releng.scl3.mozilla.com', 'mac-v2-signing4.srv.releng.scl3.mozilla.com', 'mac-v2-signing6.srv.releng.scl3.mozilla.com', 'mac-v2-signing7.srv.releng.scl3.mozilla.com'",)]
Also, mac-v2-signing13 probably needs puppetizing; I got a password prompt when trying to ssh.
Depends on: 1477139
I have my environment set up on releng-puppet2 in mdc1, but it looks like it's looking at the system hiera file instead of my env? Once I clear that up I can proceed with testing.
Attached patch tools-sslSplinter Review
Tools equivalent to https://github.com/mozilla-releng/signingscript/pull/54 .

- remove old mdc1 cert (currently not used in production)
- adds new scl3 cert. This will last past Aug30; we can remove it when it's no longer needed. Even if we retire the scl3 signing servers earlier, this should avoid the 60 day expiration warning.
- adds a new non-scl3 cert. We can keep this one until summer next year.
Assignee: nobody → aki
Attachment #8994299 - Flags: review?(catlee)
Attachment #8994302 - Flags: review?(catlee)
Attached patch puppet mdc1Splinter Review
We should be able to enable mdc2 once we test, spin up, and resolve bug 1477139 -- the new cert includes the mdc2 hosts.
Attachment #8994303 - Flags: review?(catlee)
Attachment #8994299 - Flags: review?(catlee) → review+
Attachment #8994302 - Attachment is patch: true
Attachment #8994302 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8994302 - Flags: review?(catlee) → review+
Attachment #8994303 - Attachment is patch: true
Attachment #8994303 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8994303 - Flags: review?(catlee) → review+
Fixed mac-depsigning2 by `export SDKROOT=macosx10.10` before puppetizing.
New ssl certs are rolled out, and trees appear green.
To do:

- spin up mdc2 signing servers
- test mdc2 signing servers
- enable mdc2 signing servers
- verify there are no lingering issues
I had to disable the mdc1 servers [1] because attempts to connect to request a token were hanging.
My current theory is that the routes are wrong: we may be going from use1/usw2 through scl3, which would make the request look like the scl3 exit node, which might not be allowlisted. (That's a lot of guesses =\ )

I'm not sure why my one-off tests worked, but we need to resolve this before we can re-enable. (I'm thinking about going through all the failures to try to see a pattern -- maybe only one of the two AWS regions was failing? or something?)
Dave, do you have a better idea what's going on, or know who would? I'm going to keep poking at this, but I'd welcome another pair of eyes here.

[1] https://github.com/mozilla-releng/build-puppet/commit/397edbc9f6a88ab6244f24aa213aaa8385690bb7
Flags: needinfo?(dhouse)
(In reply to Aki Sasaki [:aki] from comment #13)
> I'm not sure why my one-off tests worked, but we need to resolve this before
> we can re-enable. (I'm thinking about going through all the failures to try
> to see a pattern -- maybe only one of the two AWS regions was failing? or
> something?)

I see both use1 and usw2, both depsigning and prod signing in the failures. I do see some token requests in signing9's depsigning log, as well as some signing requests, so some requests got through. Not seeing a pattern yet.

I'm going to look at the various routing rules in AWS and compare the scl3 vs mdc1 rules.
(In reply to Aki Sasaki [:aki] from comment #14)
> (In reply to Aki Sasaki [:aki] from comment #13)
> > I'm not sure why my one-off tests worked, but we need to resolve this before
> > we can re-enable. (I'm thinking about going through all the failures to try
> > to see a pattern -- maybe only one of the two AWS regions was failing? or
> > something?)
> 
> I see both use1 and usw2, both depsigning and prod signing in the failures.
> I do see some token requests in signing9's depsigning log, as well as some
> signing requests, so some requests got through. Not seeing a pattern yet.

It looks like at least one job succeeded by a) getting its token from a non-mdc1 signing server, and then b) successfully getting a signature from an mdc1 server. The problem is directly related to the token request hanging.
Wondering if we need an mdc1 -> us{e1,w2} vpc ?
On depsigning-worker15.srv.releng.use1.mozilla.com,

- `sudo mtr signing7.srv.releng.mdc1.mozilla.com` gives:

Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                             Packets               Pings
 Host                                                      Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 169.254.255.109                                         0.0%     5    0.5   1.5   0.5   5.7   2.3
 2. 169.254.255.1                                           0.0%     4   13.0   4.6   1.7  13.0   5.6
 3. 169.254.255.2                                           0.0%     4   77.6  77.6  77.6  77.7   0.1
 4. signing7.srv.releng.mdc1.mozilla.com                    0.0%     4   77.8  78.1  77.8  78.9   0.5

- `nc -vz signing7.srv.releng.mdc1.mozilla.com 9110` hangs
(In reply to Aki Sasaki [:aki] from comment #13)
> I had to disable the mdc1 servers [1] because attempts to connect to request
> a token were hanging.
> My current theory is that the routes are wrong: we may be going from
> use1/usw2 through scl3, which would make the request look like the scl3 exit
> node, which might not be allowlisted. (That's a lot of guesses =\ )
> 
> I'm not sure why my one-off tests worked, but we need to resolve this before
> we can re-enable. (I'm thinking about going through all the failures to try
> to see a pattern -- maybe only one of the two AWS regions was failing? or
> something?)
> Dave, do you have a better idea what's going on, or know who would? I'm
> going to keep poking at this, but I'd welcome another pair of eyes here.
> 
> [1]
> https://github.com/mozilla-releng/build-puppet/commit/
> 397edbc9f6a88ab6244f24aa213aaa8385690bb7

Aki, I'm sorry I didn't check into this earlier. I missed the NI. I'll check on the status for bug 1478868 to help out a little.
Flags: needinfo?(dhouse)
Dave, I think we're good now, thanks :) Sorry, I should have cleared the ni earlier.
mdc1 servers are live. I'm going to do a little more testing with mdc2 and then bring those live, skipping mac-v2-signing13 until bug 1477139 is resolved.
Depends on: 1480512
I've requested QTS reimage mac-depsigning6.srv.releng.mdc2.mozilla.com through REQ0240274 (using recovery console, and doing a bless and reboot: `/usr/sbin/bless --netboot --server bsdp://10.51.56.16; reboot`).
(In reply to Dave House [:dhouse] from comment #23)
> I've requested QTS reimage mac-depsigning6.srv.releng.mdc2.mozilla.com
> through REQ0240274 (using recovery console, and doing a bless and reboot:
> `/usr/sbin/bless --netboot --server bsdp://10.51.56.16; reboot`).

QTS closed my request as completed. I'll follow up in bug 1480512
Currently:

- linux signing servers in mdc1 are live
- mdc2 is not live
- no mac mdc? servers are live
- there's a signingscript bug that makes token requests more fragile than it needs to be (we don't catch network issues in requesting a token, so if we choose a broken server we'll fail out rather than try another server). I have a fix in the autograph PR, but we can and should land that fix earlier.

see bug 1480512 for the mdc macs that still need help (9+13).

Ben figured out how to fix the mac signing servers without vnc!
I've spun up mac{8,10,11,12}, with testing of all 3 ports. So they're up, but nothing's pointing at them yet.

To do:

- I will update the docs.
- I'll submit a PR to catch the token request network exceptions.
- Depending on how long 9 and 13 take to fix, we can either bring up the 4 mac signing servers listed above, or all of them with the signingscript token fix, with the knowledge that we'll retry whenever we hit a disabled server
- I should also test the linux mdc2 signing servers, and bring those up. I had dep signing working in prod on a single depsigning worker, but spot testing the nightly and release ports may be wise with all the bustage we've seen rolling these out.

Once the above is done and the prod signing servers in mdc{1,2} are up, we can resolve this bug.
Depends on: 1481264
https://github.com/mozilla-releng/build-puppet/pull/159

I think we're done here.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: