Closed Bug 620517 Opened 14 years ago Closed 14 years ago

20 talos-r3-xp slaves have older passwords - network setup issue

Categories

(Release Engineering :: General, defect, P2)

x86
Windows XP

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: coop)

References

Details

(Whiteboard: [opsi][buildduty])

NOTE: A lot of the information in this bug is now documented in this section which still needs review: https://wiki.mozilla.org/ReleaseEngineering/OPSI#Troubleshooting After a slave is re-imaged, we have to regenerate the OPSI hostkey on that slave (https://wiki.mozilla.org/ReleaseEngineering/OPSI#Wrong_hostkey). I matched the hostkey on the slave (talos-r3-xp-001) to what is on the master which seems to do the trick. What happens after is that (even though the OPSI master and the OPSI client can talk to each other) it tries to reach the wrong drive (P drive instead of Z drive). The P drive information is loaded from the registry. Another way to notice this problem is that the prelogging screen takes forever (around 7 mins) before it logs into Windows (IMO it should not log-in at all or prevent buildbot from starting but this is a separate issue). You can also notice that it says "Mit Netzlaufwerken verbinden, bitte noch etwas warten" which means "Connecting to network drives, please wait a bit" and a status at the bottom right that says code "Status: 53". This code can be seen in the log below. > 12/20/2010 12:36:42 PM winstMasterDirectory P:\install\opsi-winst\files\opsi-winst > 12/20/2010 12:36:42 PM Profildateienpfad P:\install\ > 12/20/2010 12:36:42 PM Bootmode BKSTD > 12/20/2010 12:36:42 PM trying to connect remote resource "\\staging-opsi\opt_pcbin" to local resource "P:" > 12/20/2010 12:37:03 PM Fehler 53 ("The network path was not found") > 12/20/2010 12:43:20 PM try_network_connect set to false. Reached time out 180 secs Switching the drive to "Z:" on the registry and rebooting the machine does *not* seem to fix the issue. After rebooting, the registry says "P:" again. Could these values be loaded from the samba staging-opsi mount? I can see the "opt_pcbin on 'Samba 3.0.24 (10.2.71.216)' (Z:)" mount on the machine. I will try tomorrow to mount on P instead of Z and/or re-install once more (I did re-install from staging-opsi - 10.2.71.216). The drive can be checked on the registry HKLM\SOFTWARE\opsi.org\shareinfo: * configdrive * depotdrive * utilsdrive I spotted the problem by reading C:\tmp\logonlog.txt 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, pckey: DONT SHOW IT 12/20/2010 12:36:41 PM reading pckey from file "C:\Program Files\opsi.org\preloginloader\cfg\locked.cfg" 12/20/2010 12:36:41 PM pckey read from file 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\general, tftpserver: staging-opsi.build.mozilla.org 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\general, configlocal: 0 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, user: pcpatch 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, smbusername1: opsiserver\pcpatch 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, try_secondary_user: 0 ... 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, depoturl: smb:\\staging-opsi\opt_pcbin\install 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, configurl: smb:\\staging-opsi\opt_pcbin\install 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, utilsurl: smb:\\staging-opsi\opt_pcbin\utils 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, utilsdrive: P: 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, configdrive: P: 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, depotdrive: P: 12/20/2010 12:36:41 PM Error: Requested bitmap could not be loaded 12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\pcptch, button_stopnetworking: immediate 12/20/2010 12:36:41 PM Button StopNetworking enabled 12/20/2010 12:36:41 PM servicehost reached in 0 ms 12/20/2010 12:36:41 PM determining opsi client ID 12/20/2010 12:36:41 PM opsi service with URL https://10.2.71.216:4447 12/20/2010 12:36:41 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getHostId%22,%22params%22:%5B%5D%7D" 12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":""}" 12/20/2010 12:36:42 PM No client ID got from service 12/20/2010 12:36:42 PM Try with ipname from local system: talos-r3-xp-001 12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getDomain%22,%22params%22:%5B%5D%7D" 12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":"build.mozilla.org"}" 12/20/2010 12:36:42 PM Default domain from service >build.mozilla.org< 12/20/2010 12:36:42 PM we supplement default domain from service to name: talos-r3-xp-001 12/20/2010 12:36:42 PM We have client ID talos-r3-xp-001.build.mozilla.org 12/20/2010 12:36:42 PM opsi service with URL https://10.2.71.216:4447 12/20/2010 12:36:42 PM opsi service "https://10.2.71.216:4447", client "talos-r3-xp-001.build.mozilla.org" , username "talos-r3-xp-001.build.mozilla.org" 12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22authenticated%22,%22params%22:%5B%5D%7D" 12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":true}" 12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getNetworkConfig%5Fhash%22,%22params%22:%5B%22talos-r3-xp-001.build.mozilla.org%22%5D%7D" 12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":{"depotDrive":"P:","nextBootServiceURL":"https://10.2.71.216:4447","utilsUrl":"smb://staging-opsi/opt_pcbin/utils","configUrl":"smb://staging-opsi/opt_pcbin/pcpatch","utilsDrive":"P:","opsiServer":"staging-opsi.build.mozilla.org","nextBootServerType":"service","depotUrl":"smb://staging-opsi/opt_pcbin/install","depotId":"staging-opsi.build.mozilla.org","configDrive":"O:","winDomain":"buildnet"}}" 12/20/2010 12:36:42 PM depoturl for client talos-r3-xp-001.build.mozilla.org from opsi service: smb://staging-opsi/opt_pcbin/install 12/20/2010 12:36:42 PM utilsurl for client talos-r3-xp-001.build.mozilla.org from opsi service: smb://staging-opsi/opt_pcbin/utils 12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getPcpatchPassword%22,%22params%22:%5B%22talos-r3-xp-001.build.mozilla.org%22%5D%7D" 12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":"19d9ee3761ec4f746981e62dcad55a0e"}" 12/20/2010 12:36:42 PM fetched encryptedpass from opsi service 12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getGeneralConfig%5Fhash%22,%22params%22:%5B%22talos-r3-xp-001.build.mozilla.org%22%5D%7D" 12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":{"pcptchLabel1":"opsi","pcptchLabel2":"uib","button_stopnetworking":"","pcptchBitmap1":"winst1.bmp","pcptchBitmap2":"winst2.bmp","secsUntilConnectionTimeOut":"180"}}" 12/20/2010 12:36:42 PM ipNameHost: /* not found */ 12/20/2010 12:36:42 PM depoturl: smb://staging-opsi/opt_pcbin/install 12/20/2010 12:36:42 PM configurl: smb://staging-opsi/opt_pcbin/install 12/20/2010 12:36:42 PM utilsurl: smb://staging-opsi/opt_pcbin/utils 12/20/2010 12:36:42 PM depot drive: P: 12/20/2010 12:36:42 PM config drive: P: 12/20/2010 12:36:42 PM utils drive: P: 12/20/2010 12:36:42 PM depotshare \\staging-opsi\opt_pcbin 12/20/2010 12:36:42 PM depotdir \install 12/20/2010 12:36:42 PM configshare \\staging-opsi\opt_pcbin 12/20/2010 12:36:42 PM configdir \install 12/20/2010 12:36:42 PM utilsshare \\staging-opsi\opt_pcbin 12/20/2010 12:36:42 PM utilsdir \utils 12/20/2010 12:36:42 PM winstMasterDirectory P:\install\opsi-winst\files\opsi-winst 12/20/2010 12:36:42 PM Profildateienpfad P:\install\ 12/20/2010 12:36:42 PM Bootmode BKSTD 12/20/2010 12:36:42 PM trying to connect remote resource "\\staging-opsi\opt_pcbin" to local resource "P:" 12/20/2010 12:37:03 PM Fehler 53 ("The network path was not found") 12/20/2010 12:43:20 PM try_network_connect set to false. Reached time out 180 secs 12/20/2010 12:43:20 PM user of the process: pcpatch 12/20/2010 12:43:20 PM ending pcptch 12/20/2010 12:43:20 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22authenticated%22,%22params%22:%5B%5D%7D" 12/20/2010 12:43:20 PM json general Result "{"error":null,"id":1,"result":true}" 12/20/2010 12:43:20 PM Initiating log off 12/20/2010 12:43:20 PM WinstRegRebootVar 0 12/20/2010 12:43:20 PM WinstRegFinalShutdownVar 0 12/20/2010 12:43:20 PM According to registry key in HKLM\SOFTWARE\opsi.org\winst, Variable RebootRequested resp. ShutdownRequested: no shutdown and no logoff
Summary: Re-imaged staging slave does not point to the right network drive → Re-imaged staging slave does not point to the right OPSI network drive
I will deal with this once I come back on January.
Assignee: nobody → armenzg
Priority: -- → P3
Summary: Re-imaged staging slave does not point to the right OPSI network drive → Re-imaged talos-r3-xp-00{1,2} staging slaves do not point to the right OPSI network drive
Putting it into the queue in case anyone wants to debug it next week.
Assignee: armenzg → nobody
Priority: P3 → --
Summary: Re-imaged talos-r3-xp-00{1,2} staging slaves do not point to the right OPSI network drive → Re-imaged talos-r3-xp-00{1,2} staging slaves want to access the P drive rather than the Z drive for OPSI
Whiteboard: [opsi] → [opsi][buildduty]
I think this is also causing the issue in bug 611923.
talos-r3-xp-020 suffers of the same illness.
Summary: Re-imaged talos-r3-xp-00{1,2} staging slaves want to access the P drive rather than the Z drive for OPSI → talos-r3-xp-0{01,02,20} have older passwords - Fehler 53 ("The network path was not found") - (slaves want to access the P drive rather than the Z drive for OPSI)
List of slaves hitting the Fehler 53 problem and therefore have older passwords (which means cannot receive OPSI packages). * talos-r3-xp-001 * talos-r3-xp-002 * talos-r3-xp-005 * talos-r3-xp-007 * talos-r3-xp-009 * talos-r3-xp-013 * talos-r3-xp-014 * talos-r3-xp-015 * talos-r3-xp-018 * talos-r3-xp-019 * talos-r3-xp-020 * talos-r3-xp-032 * talos-r3-xp-033 * talos-r3-xp-034 * talos-r3-xp-035 * talos-r3-xp-036 * talos-r3-xp-037 * talos-r3-xp-038 * talos-r3-xp-039 * talos-r3-xp-040
Severity: normal → major
OS: Mac OS X → Windows XP
Summary: talos-r3-xp-0{01,02,20} have older passwords - Fehler 53 ("The network path was not found") - (slaves want to access the P drive rather than the Z drive for OPSI) → 20 talos-r3-xp slaves have older passwords - Fehler 53 ("The network path was not found") - (slaves want to access the P drive rather than the Z drive for OPSI)
<grumpy nthomas> It just isn't acceptable that we have machines that are 1) not in sync with OPSI 2) taking an additional 7 minutes to do a reboot Is this a recently discovered issue, or did something change to cause it ? What is the action plan to resolve this issue ASAP ?
How difficult is it to 1) disable OPSI 2) install the package by hand via VNC? I think that the long-term fix is to use Puppet instead. But that requires a lot of preparatory work, so doesn't qualify as "ASAP"
The word on the street is that Z is where we installed opsi from originally but it isn't used since, so we don't have to convince opsi to use that now. I compared c:\tmp\logonlog.txt for talos-r3-xp-039 (not up to date) with talos-r3-xp-041 (up to date) and the difference is # in the ipconfig/all output -Primary Dns Suffix . . . . . . . : +Primary Dns Suffix . . . . . . . : build.mozilla.org -DNS Suffix Search List. . . . . . : build.scl1.mozilla.com build.mozilla.org +DNS Suffix Search List. . . . . . : build.mozilla.org + build.scl1.mozilla.com build.mozilla.org + mozilla.org # later -ipName from system: talos-r3-xp-039 +ipName from system: talos-r3-xp-041.build.mozilla.org # trying to get the mount trying to connect remote resource "\\production-opsi\opt_pcbin" to local resource "P:" - Fehler 53 ("The network path was not found") -try_network_connect set to false. Reached time out 180 secs + Fehler 127 ("The specified procedure could not be found") +Netzverbindung von P: zu \\production-opsi\opt_pcbin hergestellt +user of the process: pcpatch +start working based on the network connection +Local winst exists and seems to be up to date Netzv... translates to 'Network connection P: to \\production-opsi\opt_pcbin produced'. So it looks like the network setup is different on some of these minis. I'm going to bet there was some manual setup involved after they moved to SCL, and that not all these machines are set up the same. The intersection of comment #6 and bug 611441 comment #1 is pretty strong. We need to investigate if it's a DHCP issue or something set on each slave.
Summary: 20 talos-r3-xp slaves have older passwords - Fehler 53 ("The network path was not found") - (slaves want to access the P drive rather than the Z drive for OPSI) → 20 talos-r3-xp slaves have older passwords - network setup issue
talos-r3-xp-039: ---------------- C:\Documents and Settings\cltbld>nslookup production-opsi Server: ns1.infra.scl1.mozilla.com Address: 10.12.75.10 *** ns1.infra.scl1.mozilla.com can't find production-opsi: Non-existent domain talos-r3-xp-041: ---------------- C:\Documents and Settings\cltbld>nslookup production-opsi Server: ns1.infra.scl1.mozilla.com Address: 10.12.75.10 Non-authoritative answer: Name: production-opsi.build.sjc1.mozilla.com Address: 10.2.71.64 Aliases: production-opsi.build.mozilla.org So I set the Primary DNS suffix on xp-039 (steps in next comment). On reboot it loaded OPSI quickly and updated the password. production-opsi resolves, obviously. The only delta's from the logonlog.txt from xp-041 are all the things you'd expect (timestamps, hostname, mac address, opsi key info). The 'DNS Suffix Search List's match.
Action item: For each host that needs updating, do * Connect with VNC while slave idle * Start menu * Right click on My Computer * Properties option * Computer Name tab * Change button * More button * Set Primary DNS suffix field set to 'build.mozilla.org' * Confirm NetBIOS name is still talos-r3-xp-XXX * OK buttons until you can restart
All hosts in comment 6 added to the slave spreadsheet. It looks like a working system can be detected by * nslookup production-opsi works * P: mounted * changed password after reboot
P won't be visible as a mount. It's mounted by OPSI at boot time, and then unmounted again.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Update: * talos-r3-xp-00[1-3] are waiting on re-imaging (added relevant bugs to dependencies) * talos-r3-xp-019 has had its suffic updated, but didn't shutdown cleanly. Added to the reboots bug (https://bugzilla.mozilla.org/show_bug.cgi?id=620948#c69) * all other slaves updated successfully from OPSI after following the steps in comment #11
Depends on: 627121, 627120, 624383
No longer depends on: 624383, 627120, 627121
Updated update after looking at the right slaves (ahem): * talos-r3-xp-001 also needs a reboot (https://bugzilla.mozilla.org/show_bug.cgi?id=620948#c70) * talos-r3-xp-002 needs re-imaging (bug 628037) * talos-r3-xp-003 is successfully updated So, we're still waiting on * talos-r3-xp-001: reboot * talos-r3-xp-002: re-image * talos-r3-xp-019: reboot
Depends on: 628037, 620948
talos-r3-xp-001 and talos-r3-xp-019 have been resurrected.
No longer depends on: 620948
Just fixed talos-r3-xp-002, so we're done here.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Fantastic! Thanks coop.
Very awesome!
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.