Closed Bug 1082535 Opened 10 years ago Closed 10 years ago

Upload errors / Permission Denied

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: pmoore)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/574] )

we are seeing upload errors across trees and so i closed all trees:


https://treeherder.mozilla.org/ui/logviewer.html#?job_id=2987013&repo=mozilla-inbound
Command ['ssh', '-o', 'IdentityFile=~/.ssh/ffxbld_rsa', 'ffxbld@stage.mozilla.org', 'mktemp -d'] returned non-zero exit code: 255 
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
i guess thats caused by bug 1061188
Backed out changes from bug 10601188, and running a reconfig, to recover.

Apologies, this was my change!
So Windows builders don't seem to have the ffxbld_rsa key - I will double check why.

cltbld@B-2008-IX-0157 ~
$ ls ~/.ssh
authorized_keys  b2gbld_dsa  config  environment  ffxbld_dsa  known_hosts  tbirdbld_dsa  xrbld_dsa

cltbld@B-2008-IX-0157 ~
$
I suspect the spot instances are based on a puppet image before the puppet change landed - and I think at the moment the only affected builders were linux spot instances and windows builders.
The failures that came in after the trees reopened were long-running builds (e.g. running for 2h or more) that still had the old version of e.g. mozharness.
As suspected, puppet change including ffxbld_rsa hadn't reached golden images yet:

[cltbld@bld-linux64-spot-119.build.releng.use1.mozilla.com ~]$ ls -l ~/.ssh/ffxbld*
-rw------- 1 cltbld cltbld 1679 Oct 13 01:19 /home/cltbld/.ssh/ffxbld_dsa
[cltbld@bld-linux64-spot-119.build.releng.use1.mozilla.com ~]$
(In reply to Pete Moore [:pete][:pmoore] from comment #62)
> So Windows builders don't seem to have the ffxbld_rsa key - I will double
> check why.
> 
> cltbld@B-2008-IX-0157 ~
> $ ls ~/.ssh
> authorized_keys  b2gbld_dsa  config  environment  ffxbld_dsa  known_hosts 
> tbirdbld_dsa  xrbld_dsa
> 
> cltbld@B-2008-IX-0157 ~
> $

See https://bugzilla.mozilla.org/show_bug.cgi?id=1061579#c4 - I guess we need to reimage all our Windows machines...
(In reply to Pete Moore [:pete][:pmoore] from comment #66)
> See https://bugzilla.mozilla.org/show_bug.cgi?id=1061579#c4 - I guess we
> need to reimage all our Windows machines...

GPO changes don't take effect until reboot, and with our extra b-2008-ix capacity, it's possible those machines haven't rebooted recently before taking a job.
Assignee: nobody → pmoore
It occurred to me that I don't need to necessarily reboot all windows slaves and spot instances - I could simply copy the ffxbld_dsa key to ffxbld_rsa on the slaves that haven't picked up from the new golden image or rebooted yet - and once all slaves have both keys, any future reimaging (GPO or new spot instances) should pick up the key automatically.

The only tricky thing is that is a little hard to automate on the windows machines since as we can't run remote commands over ssh with our current ssh daemon. For the spot instances, it should be relatively easy.
We do have some tools for scripting file uploads to windows slaves. See for example http://hg.mozilla.org/build/braindump/file/39b28705d819/utils/sftp.py
(In reply to Chris AtLee [:catlee] from comment #69)
> We do have some tools for scripting file uploads to windows slaves. See for
> example http://hg.mozilla.org/build/braindump/file/39b28705d819/utils/sftp.py

Awesome, thanks Catlee! I'll give it a spin.
We reimaged machines after the request to add the new key, so I'm curious why machines would have been missing this.  Mark, any ideas?
Flags: needinfo?(mcornmesser)
I'm currently uploading the new key using Nick's script that Catlee highlighted in comment 69 - so it should be on all the following 152 machines shortly:

b-2008-ix-0001
b-2008-ix-0002
b-2008-ix-0003
b-2008-ix-0004
b-2008-ix-0005
b-2008-ix-0006
b-2008-ix-0007
b-2008-ix-0008
b-2008-ix-0009
b-2008-ix-0010
b-2008-ix-0011
b-2008-ix-0012
b-2008-ix-0013
b-2008-ix-0014
b-2008-ix-0015
b-2008-ix-0016
b-2008-ix-0017
b-2008-ix-0065
b-2008-ix-0066
b-2008-ix-0067
b-2008-ix-0068
b-2008-ix-0069
b-2008-ix-0070
b-2008-ix-0071
b-2008-ix-0072
b-2008-ix-0073
b-2008-ix-0074
b-2008-ix-0075
b-2008-ix-0076
b-2008-ix-0077
b-2008-ix-0078
b-2008-ix-0079
b-2008-ix-0080
b-2008-ix-0082
b-2008-ix-0083
b-2008-ix-0084
b-2008-ix-0085
b-2008-ix-0086
b-2008-ix-0087
b-2008-ix-0088
b-2008-ix-0090
b-2008-ix-0091
b-2008-ix-0092
b-2008-ix-0093
b-2008-ix-0094
b-2008-ix-0095
b-2008-ix-0096
b-2008-ix-0098
b-2008-ix-0099
b-2008-ix-0100
b-2008-ix-0101
b-2008-ix-0102
b-2008-ix-0103
b-2008-ix-0104
b-2008-ix-0105
b-2008-ix-0106
b-2008-ix-0107
b-2008-ix-0108
b-2008-ix-0109
b-2008-ix-0110
b-2008-ix-0111
b-2008-ix-0112
b-2008-ix-0113
b-2008-ix-0114
b-2008-ix-0115
b-2008-ix-0116
b-2008-ix-0117
b-2008-ix-0118
b-2008-ix-0119
b-2008-ix-0120
b-2008-ix-0121
b-2008-ix-0122
b-2008-ix-0123
b-2008-ix-0124
b-2008-ix-0125
b-2008-ix-0126
b-2008-ix-0127
b-2008-ix-0128
b-2008-ix-0129
b-2008-ix-0130
b-2008-ix-0131
b-2008-ix-0132
b-2008-ix-0133
b-2008-ix-0134
b-2008-ix-0135
b-2008-ix-0136
b-2008-ix-0137
b-2008-ix-0138
b-2008-ix-0139
b-2008-ix-0140
b-2008-ix-0141
b-2008-ix-0142
b-2008-ix-0143
b-2008-ix-0144
b-2008-ix-0145
b-2008-ix-0146
b-2008-ix-0147
b-2008-ix-0148
b-2008-ix-0149
b-2008-ix-0150
b-2008-ix-0151
b-2008-ix-0152
b-2008-ix-0153
b-2008-ix-0154
b-2008-ix-0155
b-2008-ix-0156
b-2008-ix-0157
b-2008-ix-0158
b-2008-ix-0161
b-2008-ix-0162
b-2008-ix-0163
b-2008-ix-0164
b-2008-ix-0165
b-2008-ix-0166
b-2008-ix-0167
b-2008-ix-0168
b-2008-ix-0169
b-2008-ix-0170
b-2008-ix-0171
b-2008-ix-0172
b-2008-sm-0033
b-2008-sm-0034
b-2008-sm-0035
b-2008-sm-0036
b-2008-sm-0037
b-2008-sm-0038
b-2008-sm-0039
b-2008-sm-0040
b-2008-sm-0041
b-2008-sm-0042
b-2008-sm-0043
b-2008-sm-0044
b-2008-sm-0045
b-2008-sm-0046
b-2008-sm-0047
b-2008-sm-0048
b-2008-sm-0049
b-2008-sm-0050
b-2008-sm-0051
b-2008-sm-0052
b-2008-sm-0053
b-2008-sm-0054
b-2008-sm-0055
b-2008-sm-0056
b-2008-sm-0057
b-2008-sm-0058
b-2008-sm-0059
b-2008-sm-0060
b-2008-sm-0061
b-2008-sm-0062
b-2008-sm-0063
b-2008-sm-0064

I got this list from https://secure.pub.build.mozilla.org/slavealloc/api/slaves filtering by

if s['distro'] == 'win2k8' and s['environment'] == 'prod' and s['trustid'] == 5:

Amy, does that look like the correct filtering to you? I see it is picking up seamicro machines, but I guess this does no harm. I didn't want to omit slavealloc-disabled slaves, since if they get reenabled, they would then be missing the key.
Flags: needinfo?(arich)
(In reply to Pete Moore [:pete][:pmoore] from comment #72)
> I'm currently uploading the new key using Nick's script

Correction, Catlee's script!!

pmoore@Elisandra:~/hg/braindump/utils $ hg log sftp.py 
changeset:   214:e1b9c2c7d28d
user:        Chris AtLee <catlee@mozilla.com>
date:        Fri Jun 21 09:46:55 2013 -0400
summary:     Adding sftp utility, which works on windows!!!111!11! \o/
As a systems person, I take a inventory approach to this rather than using data from slaveapi. My search would look like: https://inventory.mozilla.org/en-US/core/search/#q=winbuild.releng.scl3.mozilla.com&System

(everything in winbuild.releng.scl3.mozilla.com).

There may be some differences in there if there are machines loaned out, etc, since we nuke the keys from those explicitly so we don't have any breaches of security. I would suggest making sure that you do NOT copy the key out to any loaners.
Flags: needinfo?(arich)
I also have a larger concern about why the key is not there in the first place. If we create new machines that are not getting the appropriate key, we're going to keep running into this problem again and again. That's why I needinfoed Mark, to make sure that we're doing the right thing at installation time.
Thanks Amy.

I agree, and will hold off rolling out any changes until we are sure the imaging process is working correctly to avoid regressions.

The inventory list matches the slavealloc one, with the exceptions:

b-2008-ix-0081: loaned to dustin in https://bugzilla.mozilla.org/show_bug.cgi?id=1066164
b-2008-ix-0097: loaned and returned in https://bugzilla.mozilla.org/show_bug.cgi?id=819366 but marked as in preprod pool in slavealloc

So the list looks good.
Maybe worth us reimaging one machine (e.g. moving b-2008-ix-0097 back to prod pool and reimaging) to see if it picks up the new key?

If that works, we should be good going forward, and might save some teeth-pulling in diagnosing why the reimaging didn't pick up the changes?

Please note, several of these machines may not have rebooted - so if the reimaging was just updating the available GPO image for the machines, and not actually rebooting them, that could be the reason.
Flags: needinfo?(arich)
Reimaging means completely reinstalling the machine form scratch, not applying a GPO (which happens at every boot). I'll let Mark take a look and decide if he wants to reimage a machine to see if it picks up the correct key.
Flags: needinfo?(arich)
I am going to go ahead and disable b-2008-ix-0097 and take a look into whats happening here.The GPO is for the key is a file copy, so a full e-imaged should not be needed.
Flags: needinfo?(mcornmesser)
I found what happened here. I dropped the file into a directory the GPO was using previously. When the try and build OUs where created there were new directories for the .ssh files that reflect the OUs. I moved the file over to the correct directory, and it is now showing up on b-2008-ix-0097. Because this is a file copy and not an application install it should be going out to the build machines with in an hour or so.
Severity: blocker → normal
[16:17:11]	nthomas	does https://bugzilla.mozilla.org/show_bug.cgi?id=1082535#c80 mean we're doing GPO tasks not at boot time ?!??
Flags: needinfo?(mcornmesser)
It does at boot as well. 

A client will check with the server periodically for updates, and as well as at boot. When it checks for the update, and it picks up a task such as a file copy it will begin executing the task. Other task such a package install, a security change, registry change, and others will not take effect or start until the next boot.
Flags: needinfo?(mcornmesser)
(In reply to Mark Cornmesser [:markco] from comment #82)
> It does at boot as well. 
> 
> A client will check with the server periodically for updates, and as well as
> at boot. When it checks for the update, and it picks up a task such as a
> file copy it will begin executing the task. Other task such a package
> install, a security change, registry change, and others will not take effect
> or start until the next boot.

That's...unexpected.

Generally we don't want puppet or GPO services to run while machines are doing builds and/or tests. The possible slowdown could cause a machine to unexpectedly timeout during routine activity, and invoke a flurry of activity from sheriffs, buildduty, and devs as sheriffs back-out patches and close trees.

Is there a way to *only* run GPO changes on reboot? I think that's how everyone in releng thought the system was already working.
A little more information. When Group Policy updates it is all down through background process, and this should not affect foreground processes. 

We could disabled the refresh interval. The result would be Group Policy refreshes would occur on log in/ log out, or for our purposes during shutdown and after boot.  

Should we open a different bug to discuss this? Also let's pull in Arr and Q into the conversation.
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/574]
OK for me to close this coop now? The only outstanding matter is comment 83/84 - if we want a separate bug for that?
Flags: needinfo?(coop)
(In reply to Pete Moore [:pete][:pmoore] from comment #85)
> OK for me to close this coop now? The only outstanding matter is comment
> 83/84 - if we want a separate bug for that?

Yes, you can close it if we track the GPO issue elsewhere. Might be a bug on file already.
Flags: needinfo?(coop)
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/81854534
See Also: → 1093072
(In reply to Chris Cooper [:coop] from comment #86)
> (In reply to Pete Moore [:pete][:pmoore] from comment #85)
> > OK for me to close this coop now? The only outstanding matter is comment
> > 83/84 - if we want a separate bug for that?
> 
> Yes, you can close it if we track the GPO issue elsewhere. Might be a bug on
> file already.

Created bug 1093072 for this.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
See Also: → 1086915
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/574] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/574]
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.