Closed
Bug 1082535
Opened 11 years ago
Closed 10 years ago
Upload errors / Permission Denied
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Assigned: pmoore)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/574] )
we are seeing upload errors across trees and so i closed all trees:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=2987013&repo=mozilla-inbound
Command ['ssh', '-o', 'IdentityFile=~/.ssh/ffxbld_rsa', 'ffxbld@stage.mozilla.org', 'mktemp -d'] returned non-zero exit code: 255
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Reporter | ||
Comment 1•11 years ago
|
||
i guess thats caused by bug 1061188
Assignee | ||
Comment 2•11 years ago
|
||
Backed out changes from bug 10601188, and running a reconfig, to recover.
Apologies, this was my change!
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 62•11 years ago
|
||
So Windows builders don't seem to have the ffxbld_rsa key - I will double check why.
cltbld@B-2008-IX-0157 ~
$ ls ~/.ssh
authorized_keys b2gbld_dsa config environment ffxbld_dsa known_hosts tbirdbld_dsa xrbld_dsa
cltbld@B-2008-IX-0157 ~
$
Assignee | ||
Comment 63•11 years ago
|
||
I suspect the spot instances are based on a puppet image before the puppet change landed - and I think at the moment the only affected builders were linux spot instances and windows builders.
Assignee | ||
Comment 64•11 years ago
|
||
The failures that came in after the trees reopened were long-running builds (e.g. running for 2h or more) that still had the old version of e.g. mozharness.
Assignee | ||
Comment 65•11 years ago
|
||
As suspected, puppet change including ffxbld_rsa hadn't reached golden images yet:
[cltbld@bld-linux64-spot-119.build.releng.use1.mozilla.com ~]$ ls -l ~/.ssh/ffxbld*
-rw------- 1 cltbld cltbld 1679 Oct 13 01:19 /home/cltbld/.ssh/ffxbld_dsa
[cltbld@bld-linux64-spot-119.build.releng.use1.mozilla.com ~]$
Assignee | ||
Comment 66•11 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #62)
> So Windows builders don't seem to have the ffxbld_rsa key - I will double
> check why.
>
> cltbld@B-2008-IX-0157 ~
> $ ls ~/.ssh
> authorized_keys b2gbld_dsa config environment ffxbld_dsa known_hosts
> tbirdbld_dsa xrbld_dsa
>
> cltbld@B-2008-IX-0157 ~
> $
See https://bugzilla.mozilla.org/show_bug.cgi?id=1061579#c4 - I guess we need to reimage all our Windows machines...
Comment 67•11 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #66)
> See https://bugzilla.mozilla.org/show_bug.cgi?id=1061579#c4 - I guess we
> need to reimage all our Windows machines...
GPO changes don't take effect until reboot, and with our extra b-2008-ix capacity, it's possible those machines haven't rebooted recently before taking a job.
Assignee: nobody → pmoore
Assignee | ||
Comment 68•11 years ago
|
||
It occurred to me that I don't need to necessarily reboot all windows slaves and spot instances - I could simply copy the ffxbld_dsa key to ffxbld_rsa on the slaves that haven't picked up from the new golden image or rebooted yet - and once all slaves have both keys, any future reimaging (GPO or new spot instances) should pick up the key automatically.
The only tricky thing is that is a little hard to automate on the windows machines since as we can't run remote commands over ssh with our current ssh daemon. For the spot instances, it should be relatively easy.
Comment 69•11 years ago
|
||
We do have some tools for scripting file uploads to windows slaves. See for example http://hg.mozilla.org/build/braindump/file/39b28705d819/utils/sftp.py
Assignee | ||
Comment 70•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #69)
> We do have some tools for scripting file uploads to windows slaves. See for
> example http://hg.mozilla.org/build/braindump/file/39b28705d819/utils/sftp.py
Awesome, thanks Catlee! I'll give it a spin.
Comment 71•11 years ago
|
||
We reimaged machines after the request to add the new key, so I'm curious why machines would have been missing this. Mark, any ideas?
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 72•11 years ago
|
||
I'm currently uploading the new key using Nick's script that Catlee highlighted in comment 69 - so it should be on all the following 152 machines shortly:
b-2008-ix-0001
b-2008-ix-0002
b-2008-ix-0003
b-2008-ix-0004
b-2008-ix-0005
b-2008-ix-0006
b-2008-ix-0007
b-2008-ix-0008
b-2008-ix-0009
b-2008-ix-0010
b-2008-ix-0011
b-2008-ix-0012
b-2008-ix-0013
b-2008-ix-0014
b-2008-ix-0015
b-2008-ix-0016
b-2008-ix-0017
b-2008-ix-0065
b-2008-ix-0066
b-2008-ix-0067
b-2008-ix-0068
b-2008-ix-0069
b-2008-ix-0070
b-2008-ix-0071
b-2008-ix-0072
b-2008-ix-0073
b-2008-ix-0074
b-2008-ix-0075
b-2008-ix-0076
b-2008-ix-0077
b-2008-ix-0078
b-2008-ix-0079
b-2008-ix-0080
b-2008-ix-0082
b-2008-ix-0083
b-2008-ix-0084
b-2008-ix-0085
b-2008-ix-0086
b-2008-ix-0087
b-2008-ix-0088
b-2008-ix-0090
b-2008-ix-0091
b-2008-ix-0092
b-2008-ix-0093
b-2008-ix-0094
b-2008-ix-0095
b-2008-ix-0096
b-2008-ix-0098
b-2008-ix-0099
b-2008-ix-0100
b-2008-ix-0101
b-2008-ix-0102
b-2008-ix-0103
b-2008-ix-0104
b-2008-ix-0105
b-2008-ix-0106
b-2008-ix-0107
b-2008-ix-0108
b-2008-ix-0109
b-2008-ix-0110
b-2008-ix-0111
b-2008-ix-0112
b-2008-ix-0113
b-2008-ix-0114
b-2008-ix-0115
b-2008-ix-0116
b-2008-ix-0117
b-2008-ix-0118
b-2008-ix-0119
b-2008-ix-0120
b-2008-ix-0121
b-2008-ix-0122
b-2008-ix-0123
b-2008-ix-0124
b-2008-ix-0125
b-2008-ix-0126
b-2008-ix-0127
b-2008-ix-0128
b-2008-ix-0129
b-2008-ix-0130
b-2008-ix-0131
b-2008-ix-0132
b-2008-ix-0133
b-2008-ix-0134
b-2008-ix-0135
b-2008-ix-0136
b-2008-ix-0137
b-2008-ix-0138
b-2008-ix-0139
b-2008-ix-0140
b-2008-ix-0141
b-2008-ix-0142
b-2008-ix-0143
b-2008-ix-0144
b-2008-ix-0145
b-2008-ix-0146
b-2008-ix-0147
b-2008-ix-0148
b-2008-ix-0149
b-2008-ix-0150
b-2008-ix-0151
b-2008-ix-0152
b-2008-ix-0153
b-2008-ix-0154
b-2008-ix-0155
b-2008-ix-0156
b-2008-ix-0157
b-2008-ix-0158
b-2008-ix-0161
b-2008-ix-0162
b-2008-ix-0163
b-2008-ix-0164
b-2008-ix-0165
b-2008-ix-0166
b-2008-ix-0167
b-2008-ix-0168
b-2008-ix-0169
b-2008-ix-0170
b-2008-ix-0171
b-2008-ix-0172
b-2008-sm-0033
b-2008-sm-0034
b-2008-sm-0035
b-2008-sm-0036
b-2008-sm-0037
b-2008-sm-0038
b-2008-sm-0039
b-2008-sm-0040
b-2008-sm-0041
b-2008-sm-0042
b-2008-sm-0043
b-2008-sm-0044
b-2008-sm-0045
b-2008-sm-0046
b-2008-sm-0047
b-2008-sm-0048
b-2008-sm-0049
b-2008-sm-0050
b-2008-sm-0051
b-2008-sm-0052
b-2008-sm-0053
b-2008-sm-0054
b-2008-sm-0055
b-2008-sm-0056
b-2008-sm-0057
b-2008-sm-0058
b-2008-sm-0059
b-2008-sm-0060
b-2008-sm-0061
b-2008-sm-0062
b-2008-sm-0063
b-2008-sm-0064
I got this list from https://secure.pub.build.mozilla.org/slavealloc/api/slaves filtering by
if s['distro'] == 'win2k8' and s['environment'] == 'prod' and s['trustid'] == 5:
Amy, does that look like the correct filtering to you? I see it is picking up seamicro machines, but I guess this does no harm. I didn't want to omit slavealloc-disabled slaves, since if they get reenabled, they would then be missing the key.
Flags: needinfo?(arich)
Assignee | ||
Comment 73•11 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #72)
> I'm currently uploading the new key using Nick's script
Correction, Catlee's script!!
pmoore@Elisandra:~/hg/braindump/utils $ hg log sftp.py
changeset: 214:e1b9c2c7d28d
user: Chris AtLee <catlee@mozilla.com>
date: Fri Jun 21 09:46:55 2013 -0400
summary: Adding sftp utility, which works on windows!!!111!11! \o/
Comment 74•11 years ago
|
||
As a systems person, I take a inventory approach to this rather than using data from slaveapi. My search would look like: https://inventory.mozilla.org/en-US/core/search/#q=winbuild.releng.scl3.mozilla.com&System
(everything in winbuild.releng.scl3.mozilla.com).
There may be some differences in there if there are machines loaned out, etc, since we nuke the keys from those explicitly so we don't have any breaches of security. I would suggest making sure that you do NOT copy the key out to any loaners.
Flags: needinfo?(arich)
Comment 75•11 years ago
|
||
I also have a larger concern about why the key is not there in the first place. If we create new machines that are not getting the appropriate key, we're going to keep running into this problem again and again. That's why I needinfoed Mark, to make sure that we're doing the right thing at installation time.
Assignee | ||
Comment 76•11 years ago
|
||
Thanks Amy.
I agree, and will hold off rolling out any changes until we are sure the imaging process is working correctly to avoid regressions.
The inventory list matches the slavealloc one, with the exceptions:
b-2008-ix-0081: loaned to dustin in https://bugzilla.mozilla.org/show_bug.cgi?id=1066164
b-2008-ix-0097: loaned and returned in https://bugzilla.mozilla.org/show_bug.cgi?id=819366 but marked as in preprod pool in slavealloc
So the list looks good.
Assignee | ||
Comment 77•11 years ago
|
||
Maybe worth us reimaging one machine (e.g. moving b-2008-ix-0097 back to prod pool and reimaging) to see if it picks up the new key?
If that works, we should be good going forward, and might save some teeth-pulling in diagnosing why the reimaging didn't pick up the changes?
Please note, several of these machines may not have rebooted - so if the reimaging was just updating the available GPO image for the machines, and not actually rebooting them, that could be the reason.
Flags: needinfo?(arich)
Comment 78•11 years ago
|
||
Reimaging means completely reinstalling the machine form scratch, not applying a GPO (which happens at every boot). I'll let Mark take a look and decide if he wants to reimage a machine to see if it picks up the correct key.
Flags: needinfo?(arich)
Comment 79•11 years ago
|
||
I am going to go ahead and disable b-2008-ix-0097 and take a look into whats happening here.The GPO is for the key is a file copy, so a full e-imaged should not be needed.
Flags: needinfo?(mcornmesser)
Comment 80•11 years ago
|
||
I found what happened here. I dropped the file into a directory the GPO was using previously. When the try and build OUs where created there were new directories for the .ssh files that reflect the OUs. I moved the file over to the correct directory, and it is now showing up on b-2008-ix-0097. Because this is a file copy and not an application install it should be going out to the build machines with in an hour or so.
Updated•11 years ago
|
Severity: blocker → normal
Comment 81•11 years ago
|
||
[16:17:11] nthomas does https://bugzilla.mozilla.org/show_bug.cgi?id=1082535#c80 mean we're doing GPO tasks not at boot time ?!??
Flags: needinfo?(mcornmesser)
Comment 82•11 years ago
|
||
It does at boot as well.
A client will check with the server periodically for updates, and as well as at boot. When it checks for the update, and it picks up a task such as a file copy it will begin executing the task. Other task such a package install, a security change, registry change, and others will not take effect or start until the next boot.
Flags: needinfo?(mcornmesser)
Comment 83•11 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #82)
> It does at boot as well.
>
> A client will check with the server periodically for updates, and as well as
> at boot. When it checks for the update, and it picks up a task such as a
> file copy it will begin executing the task. Other task such a package
> install, a security change, registry change, and others will not take effect
> or start until the next boot.
That's...unexpected.
Generally we don't want puppet or GPO services to run while machines are doing builds and/or tests. The possible slowdown could cause a machine to unexpectedly timeout during routine activity, and invoke a flurry of activity from sheriffs, buildduty, and devs as sheriffs back-out patches and close trees.
Is there a way to *only* run GPO changes on reboot? I think that's how everyone in releng thought the system was already working.
Comment 84•11 years ago
|
||
A little more information. When Group Policy updates it is all down through background process, and this should not affect foreground processes.
We could disabled the refresh interval. The result would be Group Policy refreshes would occur on log in/ log out, or for our purposes during shutdown and after boot.
Should we open a different bug to discuss this? Also let's pull in Arr and Q into the conversation.
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/574]
Assignee | ||
Comment 85•10 years ago
|
||
OK for me to close this coop now? The only outstanding matter is comment 83/84 - if we want a separate bug for that?
Flags: needinfo?(coop)
Comment 86•10 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #85)
> OK for me to close this coop now? The only outstanding matter is comment
> 83/84 - if we want a separate bug for that?
Yes, you can close it if we track the GPO issue elsewhere. Might be a bug on file already.
Flags: needinfo?(coop)
Comment 87•10 years ago
|
||
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/81854534
Assignee | ||
Comment 88•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #86)
> (In reply to Pete Moore [:pete][:pmoore] from comment #85)
> > OK for me to close this coop now? The only outstanding matter is comment
> > 83/84 - if we want a separate bug for that?
>
> Yes, you can close it if we track the GPO issue elsewhere. Might be a bug on
> file already.
Created bug 1093072 for this.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/574] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/574]
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•