Closed Bug 794248 Opened 13 years ago Closed 13 years ago

Re-imaging XP machines can come back in a non-usable state

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Windows XP
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: arich)

References

Details

(Whiteboard: [reit-ops])

Hi Van, Is there anything different on our XP imaging process? It seems that all recent XP machines cannot use taskkill properly and it causes various problems. The way to check if a machine does not work is to type: * tasklist /svc and you will see a message like this: "ERROR: Logon failure: unknown user name or bad password." talos-r3-xp-ref does not show this symptom. talos-r3-xp-001 got re-imaged today and it has this symptom. What snapshot are we using? Have any of the tools for re-imaging XP slaves have changed in the last 1-2 months? hosts? tools update? bootcamp? There might be nothing going on but it will help me a lot to cut the list of possible problems. Excuse me if I ask non-sense! :)
I PXE boot these XP hosts to get them to image so nothing has changed on my end. Van
Van - this is in reference to Amy's comment in bug 788382 comment 20 I believe Armen is asking for: a) identification of the current pxe image you're using b) what the list of available "roll back" images are (in case he needs to do a binary search on when the problem occurred) Based on the ref image working, and the re-imaged machine not working, the validity of the pxe image is questionable.
Van can't answer any of these questions because all dcops does is kick off the automated imaging process that relops controls. DCops doesn't have access to the imaging server or the captured images. Releng asked relops to take a new snapshot of the ref machine on 20120612, and this image has been used since then. We can instead image machines with the image snapshot that was requested on 20120508 or go back even further to 20111215 or 20111129. We can also try taking a new ref image, but it's unlikely that this will fix the problem unless there was a problem with the ref server in the first place that has since been corrected. Neither relops nor dcops changes any of the tools on the ref servers, and no changes were made to the imaging server where the images were captured. The only change that's been made on the deployment server end was to the password that the imaging account uses, but an error there that would have resulted in the machine not installing at all, not causing issues on the client end after the installation.
Assignee: server-ops → arich
Component: Server Operations: DCOps → Server Operations: RelEng
QA Contact: dmoore → arich
FYI, based on the error, my first guess would be that the issue might lie with the root/administrator/cltbld password change that happened just prior to the last snapshot requested by releng.
Can we re-image with the last 3 snapshots and a fresh snapshot with current state? You can use these slaves for testing purposes: talos-r3-xp-09[1-5] Meanwhile I will look into what I read on the newsgroups: > Perhaps the file system permissions are incorrect and NT AUTHORITY\NETWORK SERVICE > doesn't have permission to access C:\WINDOWS\system32\wbem\wmiprvse.exe ?
In other words, I want to verify if in the creation of snapshot 20120612 something went wrong. We might have been re-imaging slaves since then but we might have been missing that something was wrong since the only way to find out was if philor reported that such slave was failing mochitest jobs. It actually happens that we can have these slaves running on production and be unnoticed that they fail on mochitest jobs. It's only when a bunch of them went online that philor caught his eye on them.
Depends on: 794487
Easy enough to check the theory that I've missed others in the same state: what WinXP slaves were reimaged between 2012-06-12 and 2012-08-28?
I'm currently only aware of talos-r3-xp-001 and talos-r3-xp-063. Probably a bug query for "problem tracking" bugs between those dates would do the job.
5 slaves got re-imaged. None of them have touched OPSI (see screenshot http://cl.ly/JkNE) 5 of them are named talos-r3-xp-ref (which means that OPSI have done no changes) 1 of them works. I don't know what to suggest. I'm puzzled.
OS: Mac OS X → Windows XP
Summary: Please check if there is anything different on our re-imaging process for Windows XP slaves → Re-imaging XP machines can come back in a non-usable state
Latest status: Lat.image Prev.Image talos-r3-xp-091 not work works talos-r3-xp-092 not work works talos-r3-xp-093 not work not work talos-r3-xp-094 not work not work talos-r3-xp-095 works not work 15:59 arr: so that points at it *NOT* being the image 15:59 armenzg: are the server with the images at a different colo? maybe network corruption? (I'm throwing a crazy idea) 15:59 arr: because we are 100% sure this image worked previously 15:59 arr: same colo 15:59 armenzg: correct 15:59 arr: same network 15:59 armenzg: imaging process 15:59 arr: same imaging server Because things at Mozilla cannot be fun enough without ghosts.
I am not sure if this is a good idea but we have not 100% discarded OPSI and I'm not sure why we would have not hit any issues before. If you notice all 5 slaves were named "talos-r3-xp-ref". This means, that they do indeed try to talk with OPSI as the "talos-r3-xp-ref" machine. This means that the entry "talos-r3-xp-ref.uib.local"'s last seen column gets updated. If we really really wanted to test a re-imaging process without OPSI at all we would have to do is the following steps: * take talos-r3-xp-ref and disable OPSI in there (I don't know how to do that) * take a snapshot * re-image a slave and reboot it (make sure that "last seen" does not update in OPSI) I'm thinking that perhaps we re-image various "talos-r3-xp-ref" slaves at the same time and OPSI might be getting to confused to talk with few of them at the same time. I don't know. It's a wild guess. It's the only race condition I could think of. Perhaps, we want to try to re-image the 5 slaves in a sequence rather than at the same time. On another note, the only two packages that are marked to be installed are: * cleanup [1] * passwordupdate [2] [1] http://hg.mozilla.org/build/opsi-package-sources/file/e55c081cb8cf/cleanup/CLIENT_DATA/cleanup.ins [2] http://hg.mozilla.org/build/opsi-package-sources/file/e55c081cb8cf/password-update/CLIENT_DATA/README
Should we for now try to focus on recuperating a lot of the ill slaves and then debug with the last 5?
I tested talos-r3-w7-0[80-99] for arr to see if they had a similar issues since they got re-imaged recently.
(In reply to Armen Zambrano G. [:armenzg] from comment #9) > 5 slaves got re-imaged. > None of them have touched OPSI (see screenshot http://cl.ly/JkNE) > 5 of them are named talos-r3-xp-ref (which means that OPSI have done no > changes) Correct me if I'm wrong, but we've *always* needed to manually set the computer name for Windows test hosts. If the pckey matches that of the ref image, the new slave should still be able to sync up with OPSI, but wouldn't it think it already had all the relevant packages installed already? Recall that I didn't know about the production-slaves file when I setup the new XP slaves (https://bugzilla.mozilla.org/show_bug.cgi?id=788382#c7). They may have never properly synced with OPSI, unless they were force-updated by hand.
Armen: When we talked yesterday, you said that they weren't syncing with OPSI at all until their hostname was changed. If they *are* talking with OPSI, then there is still the possibility that there's something that OPSI is doing that's causing the corruption. As to this being a race condition when many hosts image at once, we did talos-r3-xp-001 on its own, so that doesn't seem like it would be the cause unless the problem is too many hosts still checking in as the ref machine because their hostnames haven't been changed. Is that still the case? If we suspect OPSI might be the problem, then I recommend shutting down OPSI while we reimage machines (as I recall from when we cut over from sjc1 to scl3, this does not cause a disruption in production?).
We could easily filter out the relevant hosts with iptables on the OPSI host. Also, keep in mind that 001-005 (IIRC) can't talk to Microsoft due to the long-stalled network-isolation project, so if there's different behavior on those hosts from the other, that may be a factor.
A slave as "talos-r3-xp-ref" will only get 2 packages deployed: cleanup and passwordupdate. Beyond that, I don't know anything else of what OPSI can do or does not do. coop, I checked with talos-r3-xp-082 that with or without production-slaves does not cause problems. If you notice, some of your slaves came out right and some came out wrong. arr, I never realized that the "talos-r3-xp-ref" entry in OPSI counts as a point where a slave can sync up with OPSI. The same slave would not go through OPSI as "talos-r3-xp-ref" more than once unless a human reboots the slave again without changing the hostname. In bug 794617, I need to take a snapshot for another reason. Would you like us to try to disable OPSI and create a snapshot without it? On another note, I think we should fix the currently broken slaves and leave the last 5 to do debugging. If no one finds any inconveniences I will go ahead and investigate this. Re-imaging process (as I currently understand it): * RelOps mounts an image in a server * DCOps re-images slave by pointing to that server * slave boots up as "talos-r3-xp-ref" and syncs up once with OPSI * releng changes the hostname and reboots the slave * the slave reboots with correct hostname ** This time runslave.py/start-buildbot.bat succeeds and the slave can take jobs. Summary: * re-imaging + one sync with OPSI can yield different results (some slaves ill, some slave sick) * using the previous image or the newest image can produce sick slaves * we know that some security settings get lost * imaging server, network, opsi are all on the same colo * race condition as "talos-r3-xp-ref" is unlikley Actions: * fix slaves (leave 5 behind) * create an image without OPSI and re-image some slaves to see if any come out wrong Machines involved in the process: production-opsi.srv.releng.scl3.mozilla.com has address 10.26.48.38 talos-r3-xp-ref.build.scl1.mozilla.com has address 10.12.51.229
FYI, OPSI is *not* in the same colo, that is in scl3 while these slaves are in scl1. Is there any way that we can remote talos-r3-xp-ref from the OPSI config all together (not disable OPSI, but just ensure that it doesn't run successfully for a client until the hostname is changed) so that it never runs for that hostname? This assumes that the cleanup and passwordupdate scripts would run for the host once it's name is changed. Regarding your suggestion about the other nodes, I don't think we know how to fix the broken slaves, do we? It seems to be a crap shoot whether or not we get a working image (and who knows what else might be wrong). And regarding the race condition, it is still possible if other hosts are named talos-r3-xp-ref and contacting the OPSI server at the same time. I don't think we can rule that out without shutting down OPSI and trying this (or firewalling off some number of hosts, but just temporarily shutting down OPSI is easier/more expedient).
(In reply to Amy Rich [:arich] [:arr] from comment #18) > Is there any way that we can remove talos-r3-xp-ref from the OPSI config all > together (not disable OPSI, but just ensure that it doesn't run successfully > for a client until the hostname is changed) so that it never runs for that > hostname? This assumes that the cleanup and passwordupdate scripts would > run for the host once it's name is changed. > bhearsum, coop: what do you think of this ^ suggestion? I have no objections. > Regarding your suggestion about the other nodes, I don't think we know how > to fix the broken slaves, do we? It seems to be a crap shoot whether or not > we get a working image (and who knows what else might be wrong). > Right. > And regarding the race condition, it is still possible if other hosts are > named talos-r3-xp-ref and contacting the OPSI server at the same time. I > don't think we can rule that out without shutting down OPSI and trying this > (or firewalling off some number of hosts, but just temporarily shutting down > OPSI is easier/more expedient). OK. Updated list of ill machines: * talos-r3-xp-001 * talos-r3-xp-063 * talos-r3-xp-079 * talos-r3-xp-081 * talos-r3-xp-082 * talos-r3-xp-084 * talos-r3-xp-085 * talos-r3-xp-086 * talos-r3-xp-088 * talos-r3-xp-093 * talos-r3-xp-094 * talos-r3-xp-095 (actually dead - needs reboot; I think)
http://social.technet.microsoft.com/Forums/en-US/winserverManagement/thread/fb67e66c-ab6f-43be-b1e0-9c8396f7307a on an affected machine: dcomcnfg will crash when you twirldown Computers under Component Services wmimgmt.msc fails (with the same permissions error) while showing properties nothing's different in local computer settings
I have set aside these slaves as healthy slaves to mess with: * talos-r3-xp-065 * talos-r3-xp-071 * talos-r3-xp-076 * talos-r3-xp-091 * talos-r3-xp-092
production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.102 production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.108 production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.42 production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.43 production-opsi:~# iptables -A INPUT -j REJECT --source 10.12.50.27 These slaves won't be able to talk to OPSI. arr will be re-imaging them and see how they come out.
I got confused and I gave arr a list of *working* slaves rather than *broken* ones. This time I got these slaves talos-r3-xp-0{85,86,88,93,94} iptables -A INPUT -j REJECT --source 10.12.50.36 iptables -A INPUT -j REJECT --source 10.12.50.37 iptables -A INPUT -j REJECT --source 10.12.50.39 iptables -A INPUT -j REJECT --source 10.12.50.44 iptables -A INPUT -j REJECT --source 10.12.50.45 production-opsi:~# iptables --list Chain INPUT (policy ACCEPT) target prot opt source destination REJECT 0 -- talos-r3-xp-085.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable REJECT 0 -- talos-r3-xp-086.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable REJECT 0 -- talos-r3-xp-088.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable REJECT 0 -- talos-r3-xp-093.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable REJECT 0 -- talos-r3-xp-094.build.scl1.mozilla.com anywhere reject-with icmp-port-unreachable Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination
Blocks: 794987
talos-r3-xp-0{85,91,92,93} got re-imaged and came out right. I will try setting them up and put them in the pool. FTR I have removed all IP restrictions on production-opsi. Updated list of ill machines: * talos-r3-xp-001 * talos-r3-xp-063 * talos-r3-xp-079 * talos-r3-xp-081 * talos-r3-xp-082 * talos-r3-xp-084 * talos-r3-xp-086 * talos-r3-xp-088 * talos-r3-xp-094 * talos-r3-xp-095 (actually dead - needs reboot; I think)
Based on the need for testers right now, I'm doing what I can to get these to a working state by reinstalling them again. If I keep getting a few working ones each time, that's something. :/ Also, 094 was already in a working state, so I'm not reinstalling that one.
Steps to setup any of the slaves that arr might be able to reuperate: 1) run tasklist and see that it works (otherwise to be re-imaged) 2) change the hostname 3) open OPSI.jar and mark _dumbwin32proc as "setup" (will be fixed after bug 795032) 4) enable the slave in slavealloc 5) reboot the machine There is no need to remove slaves from comment 24 from OPSI.jar since I have gone and remove the entries in advance. Missing entries get regenerated every 5 minutes and the state should be copied over from the state of "talos-r3-xp-ref". If in doubt, leave it for me in the morning.
These now have a working image: * talos-r3-xp-063 * talos-r3-xp-079 * talos-r3-xp-082 * talos-r3-xp-094
Well, that's new and different. I tried imaging the remaining hosts with the *new* xp image I took today, and out of those, the following came up and looked fine: * talos-r3-xp-081 * talos-r3-xp-088 And these not only did not look fine, but after they imaged and rebooted, they were *completely* unreachable (no ping). Maru connected a monitor and said there was no video signal. He tried power cycling 2 out of 3 of them, and said it was running chkdsk and changing scads of security permissions back to default. * talos-r3-xp-084 * talos-r3-xp-086 * talos-r3-xp-095 He tried netbooting them again with the same result (so those three are still unresponsive).
091, 092, 094, and 079 have each done a single Talos run when they hit Talos' "timeout exceeded." One run. I have no explanation for that.
(In reply to Amy Rich [:arich] [:arr] from comment #28) > He tried power cycling 2 out of 3 of them, > and said it was running chkdsk and changing scads of security permissions > back to default. The chkdsk after a NTFS reimage is normal. The DS finalize scripts actually set the chkdsk on boot flag purposely.
But is it normal to restore the permissions to default? And what does "default" mean in this case, anyway? Chkdisk doesn't have a map of what permissions all files should have..
Thanks arr for managing to put so many slaves back into the pool. I have put talos-r3-xp-0{81,88} into the pool. ################################## (In reply to Phil Ringnalda (:philor) from comment #29) > 091, 092, 094, and 079 have each done a single Talos run when they hit > Talos' "timeout exceeded." One run. I have no explanation for that. All of them were due to datazilla problems on mozilla-aurora (it got fixed last night): "Rev3 WINNT 5.1 mozilla-aurora pgo talos" See the log: http://pastebin.mozilla.org/1847780 ################################## List of remaining ill slaves: * talos-r3-xp-001 (staging) * talos-r3-xp-084 * talos-r3-xp-086 * talos-r3-xp-095
Should I run an antivirus on talos-r3-xp-ref? (worth asking)
spinning this one out of the dependency tree for bug 788382 (as that closed), and added it to see also
No longer blocks: 788382
See Also: → 788382
Whiteboard: [reit-ops]
I believe I fixed the issue where it wasn't expanding the NTFS volume after restoration. 001, 084, and 086 seem to have come back up okay. 081, 086, and o95 are taking forever to respond to ping, so I suspect they're going through and chkdsk is munging the FS again. I'm starting to wonder if we've had this issue all along (or at least for a very long time) and just never noticed it because we don't reimage XP machines very often.
According to catlee on irc, taskkill is a recent addition to the XP machines. It's quite possible that the imaging process is behaving the same as it always did, and that taskkill is just exposing the fact that it's always cleaned up permissions on the final chkdsk in a number of cases.
I don't think that taskkill got started being used recently and that triggered the underlying issue. automation.py.in has taskkill in its code for at least since 2009 [1]. I stopped looking furhter back in time. I'm almost 100% sure that it was automation.py choking rather than _dumbwin32proc.py's taskkill because 1) I could hit it manually without buildbot and 2) I had placed debugging code that led me to a specific line of automation.py Also to note that, _dumbwin32proc.py got replaced on the XP slaves at the beginning of 2011 [1]. Also notice that the last 20 XP slaves got re-imaged out of a snapshot that did not have the correct _dumbwin32proc.py and OPSI was *not* deploying it. [1] http://hg.mozilla.org/mozilla-central/annotate/a2a7177a218e/build/automation.py.in#l128 [2] http://hg.mozilla.org/build/opsi-package-sources/log/e55c081cb8cf/twisted_dumbwin32proc/CLIENT_DATA/_dumbwin32proc.py
I've now gotten a successful image on: * 081 * 086 * 095 All XP machines should, afaict, have working images now. When talking to Jake, he was under the impression that these machines always took over an hour to chkdsk after rebooting, which is further antecedal evidence that we may have had this issue for a long time. In light of the fact that there is no clear cause and taking new images does not fix it, the best solution is to document that this is a problem, how to identify it, and instructions that the machine should just be netbooted again till it works.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
I have put all remainging slaves back into the pool.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.