Closed
Bug 794248
Opened 13 years ago
Closed 13 years ago
Re-imaging XP machines can come back in a non-usable state
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: arich)
References
Details
(Whiteboard: [reit-ops])
Hi Van,
Is there anything different on our XP imaging process?
It seems that all recent XP machines cannot use taskkill properly and it causes various problems.
The way to check if a machine does not work is to type:
* tasklist /svc
and you will see a message like this:
"ERROR: Logon failure: unknown user name or bad password."
talos-r3-xp-ref does not show this symptom.
talos-r3-xp-001 got re-imaged today and it has this symptom.
What snapshot are we using?
Have any of the tools for re-imaging XP slaves have changed in the last 1-2 months? hosts? tools update? bootcamp?
There might be nothing going on but it will help me a lot to cut the list of possible problems.
Excuse me if I ask non-sense! :)
Comment 1•13 years ago
|
||
I PXE boot these XP hosts to get them to image so nothing has changed on my end.
Van
Van - this is in reference to Amy's comment in bug 788382 comment 20
I believe Armen is asking for:
a) identification of the current pxe image you're using
b) what the list of available "roll back" images are (in case he needs to do a binary search on when the problem occurred)
Based on the ref image working, and the re-imaged machine not working, the validity of the pxe image is questionable.
| Assignee | ||
Comment 3•13 years ago
|
||
Van can't answer any of these questions because all dcops does is kick off the automated imaging process that relops controls. DCops doesn't have access to the imaging server or the captured images.
Releng asked relops to take a new snapshot of the ref machine on 20120612, and this image has been used since then. We can instead image machines with the image snapshot that was requested on 20120508 or go back even further to 20111215 or 20111129. We can also try taking a new ref image, but it's unlikely that this will fix the problem unless there was a problem with the ref server in the first place that has since been corrected.
Neither relops nor dcops changes any of the tools on the ref servers, and no changes were made to the imaging server where the images were captured. The only change that's been made on the deployment server end was to the password that the imaging account uses, but an error there that would have resulted in the machine not installing at all, not causing issues on the client end after the installation.
Assignee: server-ops → arich
Component: Server Operations: DCOps → Server Operations: RelEng
QA Contact: dmoore → arich
| Assignee | ||
Comment 4•13 years ago
|
||
FYI, based on the error, my first guess would be that the issue might lie with the root/administrator/cltbld password change that happened just prior to the last snapshot requested by releng.
| Reporter | ||
Comment 5•13 years ago
|
||
Can we re-image with the last 3 snapshots and a fresh snapshot with current state?
You can use these slaves for testing purposes:
talos-r3-xp-09[1-5]
Meanwhile I will look into what I read on the newsgroups:
> Perhaps the file system permissions are incorrect and NT AUTHORITY\NETWORK SERVICE > doesn't have permission to access C:\WINDOWS\system32\wbem\wmiprvse.exe ?
| Reporter | ||
Comment 6•13 years ago
|
||
In other words,
I want to verify if in the creation of snapshot 20120612 something went wrong.
We might have been re-imaging slaves since then but we might have been missing that something was wrong since the only way to find out was if philor reported that such slave was failing mochitest jobs.
It actually happens that we can have these slaves running on production and be unnoticed that they fail on mochitest jobs.
It's only when a bunch of them went online that philor caught his eye on them.
Comment 7•13 years ago
|
||
Easy enough to check the theory that I've missed others in the same state: what WinXP slaves were reimaged between 2012-06-12 and 2012-08-28?
| Reporter | ||
Comment 8•13 years ago
|
||
I'm currently only aware of talos-r3-xp-001 and talos-r3-xp-063.
Probably a bug query for "problem tracking" bugs between those dates would do the job.
| Reporter | ||
Comment 9•13 years ago
|
||
5 slaves got re-imaged.
None of them have touched OPSI (see screenshot http://cl.ly/JkNE)
5 of them are named talos-r3-xp-ref (which means that OPSI have done no changes)
1 of them works.
I don't know what to suggest. I'm puzzled.
| Reporter | ||
Updated•13 years ago
|
OS: Mac OS X → Windows XP
Summary: Please check if there is anything different on our re-imaging process for Windows XP slaves → Re-imaging XP machines can come back in a non-usable state
| Reporter | ||
Comment 10•13 years ago
|
||
Latest status:
Lat.image Prev.Image
talos-r3-xp-091 not work works
talos-r3-xp-092 not work works
talos-r3-xp-093 not work not work
talos-r3-xp-094 not work not work
talos-r3-xp-095 works not work
15:59 arr: so that points at it *NOT* being the image
15:59 armenzg: are the server with the images at a different colo? maybe network corruption? (I'm throwing a crazy idea)
15:59 arr: because we are 100% sure this image worked previously
15:59 arr: same colo
15:59 armenzg: correct
15:59 arr: same network
15:59 armenzg: imaging process
15:59 arr: same imaging server
Because things at Mozilla cannot be fun enough without ghosts.
| Reporter | ||
Comment 11•13 years ago
|
||
I am not sure if this is a good idea but we have not 100% discarded OPSI and I'm not sure why we would have not hit any issues before.
If you notice all 5 slaves were named "talos-r3-xp-ref".
This means, that they do indeed try to talk with OPSI as the "talos-r3-xp-ref" machine.
This means that the entry "talos-r3-xp-ref.uib.local"'s last seen column gets updated.
If we really really wanted to test a re-imaging process without OPSI at all we would have to do is the following steps:
* take talos-r3-xp-ref and disable OPSI in there (I don't know how to do that)
* take a snapshot
* re-image a slave and reboot it (make sure that "last seen" does not update in OPSI)
I'm thinking that perhaps we re-image various "talos-r3-xp-ref" slaves at the same time and OPSI might be getting to confused to talk with few of them at the same time.
I don't know. It's a wild guess. It's the only race condition I could think of.
Perhaps, we want to try to re-image the 5 slaves in a sequence rather than at the same time.
On another note, the only two packages that are marked to be installed are:
* cleanup [1]
* passwordupdate [2]
[1] http://hg.mozilla.org/build/opsi-package-sources/file/e55c081cb8cf/cleanup/CLIENT_DATA/cleanup.ins
[2] http://hg.mozilla.org/build/opsi-package-sources/file/e55c081cb8cf/password-update/CLIENT_DATA/README
| Reporter | ||
Comment 12•13 years ago
|
||
Should we for now try to focus on recuperating a lot of the ill slaves and then debug with the last 5?
| Reporter | ||
Comment 13•13 years ago
|
||
I tested talos-r3-w7-0[80-99] for arr to see if they had a similar issues since they got re-imaged recently.
Comment 14•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] from comment #9)
> 5 slaves got re-imaged.
> None of them have touched OPSI (see screenshot http://cl.ly/JkNE)
> 5 of them are named talos-r3-xp-ref (which means that OPSI have done no
> changes)
Correct me if I'm wrong, but we've *always* needed to manually set the computer name for Windows test hosts.
If the pckey matches that of the ref image, the new slave should still be able to sync up with OPSI, but wouldn't it think it already had all the relevant packages installed already?
Recall that I didn't know about the production-slaves file when I setup the new XP slaves (https://bugzilla.mozilla.org/show_bug.cgi?id=788382#c7). They may have never properly synced with OPSI, unless they were force-updated by hand.
| Assignee | ||
Comment 15•13 years ago
|
||
Armen: When we talked yesterday, you said that they weren't syncing with OPSI at all until their hostname was changed. If they *are* talking with OPSI, then there is still the possibility that there's something that OPSI is doing that's causing the corruption.
As to this being a race condition when many hosts image at once, we did talos-r3-xp-001 on its own, so that doesn't seem like it would be the cause unless the problem is too many hosts still checking in as the ref machine because their hostnames haven't been changed. Is that still the case?
If we suspect OPSI might be the problem, then I recommend shutting down OPSI while we reimage machines (as I recall from when we cut over from sjc1 to scl3, this does not cause a disruption in production?).
Comment 16•13 years ago
|
||
We could easily filter out the relevant hosts with iptables on the OPSI host.
Also, keep in mind that 001-005 (IIRC) can't talk to Microsoft due to the long-stalled network-isolation project, so if there's different behavior on those hosts from the other, that may be a factor.
| Reporter | ||
Comment 17•13 years ago
|
||
A slave as "talos-r3-xp-ref" will only get 2 packages deployed: cleanup and passwordupdate. Beyond that, I don't know anything else of what OPSI can do or does not do.
coop, I checked with talos-r3-xp-082 that with or without production-slaves does not cause problems. If you notice, some of your slaves came out right and some came out wrong.
arr, I never realized that the "talos-r3-xp-ref" entry in OPSI counts as a point where a slave can sync up with OPSI.
The same slave would not go through OPSI as "talos-r3-xp-ref" more than once unless a human reboots the slave again without changing the hostname.
In bug 794617, I need to take a snapshot for another reason. Would you like us to try to disable OPSI and create a snapshot without it?
On another note, I think we should fix the currently broken slaves and leave the last 5 to do debugging. If no one finds any inconveniences I will go ahead and investigate this.
Re-imaging process (as I currently understand it):
* RelOps mounts an image in a server
* DCOps re-images slave by pointing to that server
* slave boots up as "talos-r3-xp-ref" and syncs up once with OPSI
* releng changes the hostname and reboots the slave
* the slave reboots with correct hostname
** This time runslave.py/start-buildbot.bat succeeds and the slave can take jobs.
Summary:
* re-imaging + one sync with OPSI can yield different results (some slaves ill, some slave sick)
* using the previous image or the newest image can produce sick slaves
* we know that some security settings get lost
* imaging server, network, opsi are all on the same colo
* race condition as "talos-r3-xp-ref" is unlikley
Actions:
* fix slaves (leave 5 behind)
* create an image without OPSI and re-image some slaves to see if any come out wrong
Machines involved in the process:
production-opsi.srv.releng.scl3.mozilla.com has address 10.26.48.38
talos-r3-xp-ref.build.scl1.mozilla.com has address 10.12.51.229
| Assignee | ||
Comment 18•13 years ago
|
||
FYI, OPSI is *not* in the same colo, that is in scl3 while these slaves are in scl1.
Is there any way that we can remote talos-r3-xp-ref from the OPSI config all together (not disable OPSI, but just ensure that it doesn't run successfully for a client until the hostname is changed) so that it never runs for that hostname? This assumes that the cleanup and passwordupdate scripts would run for the host once it's name is changed.
Regarding your suggestion about the other nodes, I don't think we know how to fix the broken slaves, do we? It seems to be a crap shoot whether or not we get a working image (and who knows what else might be wrong).
And regarding the race condition, it is still possible if other hosts are named talos-r3-xp-ref and contacting the OPSI server at the same time. I don't think we can rule that out without shutting down OPSI and trying this (or firewalling off some number of hosts, but just temporarily shutting down OPSI is easier/more expedient).
| Reporter | ||
Comment 19•13 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #18)
> Is there any way that we can remove talos-r3-xp-ref from the OPSI config all
> together (not disable OPSI, but just ensure that it doesn't run successfully
> for a client until the hostname is changed) so that it never runs for that
> hostname? This assumes that the cleanup and passwordupdate scripts would
> run for the host once it's name is changed.
>
bhearsum, coop: what do you think of this ^ suggestion?
I have no objections.
> Regarding your suggestion about the other nodes, I don't think we know how
> to fix the broken slaves, do we? It seems to be a crap shoot whether or not
> we get a working image (and who knows what else might be wrong).
>
Right.
> And regarding the race condition, it is still possible if other hosts are
> named talos-r3-xp-ref and contacting the OPSI server at the same time. I
> don't think we can rule that out without shutting down OPSI and trying this
> (or firewalling off some number of hosts, but just temporarily shutting down
> OPSI is easier/more expedient).
OK.
Updated list of ill machines:
* talos-r3-xp-001
* talos-r3-xp-063
* talos-r3-xp-079
* talos-r3-xp-081
* talos-r3-xp-082
* talos-r3-xp-084
* talos-r3-xp-085
* talos-r3-xp-086
* talos-r3-xp-088
* talos-r3-xp-093
* talos-r3-xp-094
* talos-r3-xp-095 (actually dead - needs reboot; I think)
Comment 20•13 years ago
|
||
http://social.technet.microsoft.com/Forums/en-US/winserverManagement/thread/fb67e66c-ab6f-43be-b1e0-9c8396f7307a
on an affected machine:
dcomcnfg will crash when you twirldown Computers under Component Services
wmimgmt.msc fails (with the same permissions error) while showing properties
nothing's different in local computer settings
| Reporter | ||
Updated•13 years ago
|
Blocks: talos-r3-xp-065
| Reporter | ||
Updated•13 years ago
|
Blocks: talos-r3-xp-071
| Reporter | ||
Comment 21•13 years ago
|
||
I have set aside these slaves as healthy slaves to mess with:
* talos-r3-xp-065
* talos-r3-xp-071
* talos-r3-xp-076
* talos-r3-xp-091
* talos-r3-xp-092
| Reporter | ||
Comment 22•13 years ago
|
||
production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.102
production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.108
production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.42
production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.43
production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.27
These slaves won't be able to talk to OPSI.
arr will be re-imaging them and see how they come out.
| Reporter | ||
Comment 23•13 years ago
|
||
I got confused and I gave arr a list of *working* slaves rather than *broken* ones.
This time I got these slaves talos-r3-xp-0{85,86,88,93,94}
iptables -A INPUT -j REJECT --source 10.12.50.36
iptables -A INPUT -j REJECT --source 10.12.50.37
iptables -A INPUT -j REJECT --source 10.12.50.39
iptables -A INPUT -j REJECT --source 10.12.50.44
iptables -A INPUT -j REJECT --source 10.12.50.45
production-opsi:~# iptables --list
Chain INPUT (policy ACCEPT)
target prot opt source destination
REJECT 0 -- talos-r3-xp-085.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable
REJECT 0 -- talos-r3-xp-086.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable
REJECT 0 -- talos-r3-xp-088.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable
REJECT 0 -- talos-r3-xp-093.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable
REJECT 0 -- talos-r3-xp-094.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
| Reporter | ||
Comment 24•13 years ago
|
||
talos-r3-xp-0{85,91,92,93} got re-imaged and came out right.
I will try setting them up and put them in the pool.
FTR I have removed all IP restrictions on production-opsi.
Updated list of ill machines:
* talos-r3-xp-001
* talos-r3-xp-063
* talos-r3-xp-079
* talos-r3-xp-081
* talos-r3-xp-082
* talos-r3-xp-084
* talos-r3-xp-086
* talos-r3-xp-088
* talos-r3-xp-094
* talos-r3-xp-095 (actually dead - needs reboot; I think)
| Assignee | ||
Comment 25•13 years ago
|
||
Based on the need for testers right now, I'm doing what I can to get these to a working state by reinstalling them again. If I keep getting a few working ones each time, that's something. :/
Also, 094 was already in a working state, so I'm not reinstalling that one.
| Reporter | ||
Comment 26•13 years ago
|
||
Steps to setup any of the slaves that arr might be able to reuperate:
1) run tasklist and see that it works (otherwise to be re-imaged)
2) change the hostname
3) open OPSI.jar and mark _dumbwin32proc as "setup" (will be fixed after bug 795032)
4) enable the slave in slavealloc
5) reboot the machine
There is no need to remove slaves from comment 24 from OPSI.jar since I have gone and remove the entries in advance. Missing entries get regenerated every 5 minutes and the state should be copied over from the state of "talos-r3-xp-ref".
If in doubt, leave it for me in the morning.
| Assignee | ||
Comment 27•13 years ago
|
||
These now have a working image:
* talos-r3-xp-063
* talos-r3-xp-079
* talos-r3-xp-082
* talos-r3-xp-094
| Assignee | ||
Comment 28•13 years ago
|
||
Well, that's new and different.
I tried imaging the remaining hosts with the *new* xp image I took today, and out of those, the following came up and looked fine:
* talos-r3-xp-081
* talos-r3-xp-088
And these not only did not look fine, but after they imaged and rebooted, they were *completely* unreachable (no ping). Maru connected a monitor and said there was no video signal. He tried power cycling 2 out of 3 of them, and said it was running chkdsk and changing scads of security permissions back to default.
* talos-r3-xp-084
* talos-r3-xp-086
* talos-r3-xp-095
He tried netbooting them again with the same result (so those three are still unresponsive).
Comment 29•13 years ago
|
||
091, 092, 094, and 079 have each done a single Talos run when they hit Talos' "timeout exceeded." One run. I have no explanation for that.
Comment 30•13 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #28)
> He tried power cycling 2 out of 3 of them,
> and said it was running chkdsk and changing scads of security permissions
> back to default.
The chkdsk after a NTFS reimage is normal. The DS finalize scripts actually set the chkdsk on boot flag purposely.
Comment 31•13 years ago
|
||
But is it normal to restore the permissions to default? And what does "default" mean in this case, anyway? Chkdisk doesn't have a map of what permissions all files should have..
| Reporter | ||
Comment 32•13 years ago
|
||
Thanks arr for managing to put so many slaves back into the pool.
I have put talos-r3-xp-0{81,88} into the pool.
##################################
(In reply to Phil Ringnalda (:philor) from comment #29)
> 091, 092, 094, and 079 have each done a single Talos run when they hit
> Talos' "timeout exceeded." One run. I have no explanation for that.
All of them were due to datazilla problems on mozilla-aurora (it got fixed last night): "Rev3 WINNT 5.1 mozilla-aurora pgo talos"
See the log: http://pastebin.mozilla.org/1847780
##################################
List of remaining ill slaves:
* talos-r3-xp-001 (staging)
* talos-r3-xp-084
* talos-r3-xp-086
* talos-r3-xp-095
| Reporter | ||
Comment 33•13 years ago
|
||
Should I run an antivirus on talos-r3-xp-ref? (worth asking)
Comment 34•13 years ago
|
||
spinning this one out of the dependency tree for bug 788382 (as that closed), and added it to see also
| Assignee | ||
Comment 35•13 years ago
|
||
I believe I fixed the issue where it wasn't expanding the NTFS volume after restoration.
001, 084, and 086 seem to have come back up okay.
081, 086, and o95 are taking forever to respond to ping, so I suspect they're going through and chkdsk is munging the FS again.
I'm starting to wonder if we've had this issue all along (or at least for a very long time) and just never noticed it because we don't reimage XP machines very often.
| Assignee | ||
Comment 36•13 years ago
|
||
According to catlee on irc, taskkill is a recent addition to the XP machines. It's quite possible that the imaging process is behaving the same as it always did, and that taskkill is just exposing the fact that it's always cleaned up permissions on the final chkdsk in a number of cases.
| Reporter | ||
Comment 37•13 years ago
|
||
I don't think that taskkill got started being used recently and that triggered the underlying issue.
automation.py.in has taskkill in its code for at least since 2009 [1]. I stopped looking furhter back in time.
I'm almost 100% sure that it was automation.py choking rather than _dumbwin32proc.py's taskkill because 1) I could hit it manually without buildbot and 2) I had placed debugging code that led me to a specific line of automation.py
Also to note that, _dumbwin32proc.py got replaced on the XP slaves at the beginning of 2011 [1].
Also notice that the last 20 XP slaves got re-imaged out of a snapshot that did not have the correct _dumbwin32proc.py and OPSI was *not* deploying it.
[1] http://hg.mozilla.org/mozilla-central/annotate/a2a7177a218e/build/automation.py.in#l128
[2] http://hg.mozilla.org/build/opsi-package-sources/log/e55c081cb8cf/twisted_dumbwin32proc/CLIENT_DATA/_dumbwin32proc.py
| Assignee | ||
Comment 38•13 years ago
|
||
I've now gotten a successful image on:
* 081
* 086
* 095
All XP machines should, afaict, have working images now.
When talking to Jake, he was under the impression that these machines always took over an hour to chkdsk after rebooting, which is further antecedal evidence that we may have had this issue for a long time.
In light of the fact that there is no clear cause and taking new images does not fix it, the best solution is to document that this is a problem, how to identify it, and instructions that the machine should just be netbooted again till it works.
| Assignee | ||
Comment 39•13 years ago
|
||
I've documented this behavior here:
https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=28575847
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 40•13 years ago
|
||
I have put all remainging slaves back into the pool.
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•