Closed
Bug 620517
Opened 14 years ago
Closed 14 years ago
20 talos-r3-xp slaves have older passwords - network setup issue
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: coop)
References
Details
(Whiteboard: [opsi][buildduty])
NOTE: A lot of the information in this bug is now documented in this section which still needs review: https://wiki.mozilla.org/ReleaseEngineering/OPSI#Troubleshooting
After a slave is re-imaged, we have to regenerate the OPSI hostkey on that slave (https://wiki.mozilla.org/ReleaseEngineering/OPSI#Wrong_hostkey).
I matched the hostkey on the slave (talos-r3-xp-001) to what is on the master which seems to do the trick.
What happens after is that (even though the OPSI master and the OPSI client can talk to each other) it tries to reach the wrong drive (P drive instead of Z drive). The P drive information is loaded from the registry.
Another way to notice this problem is that the prelogging screen takes forever (around 7 mins) before it logs into Windows (IMO it should not log-in at all or prevent buildbot from starting but this is a separate issue).
You can also notice that it says "Mit Netzlaufwerken verbinden, bitte noch etwas warten" which means "Connecting to network drives, please wait a bit" and a status at the bottom right that says code "Status: 53". This code can be seen in the log below.
> 12/20/2010 12:36:42 PM winstMasterDirectory P:\install\opsi-winst\files\opsi-winst
> 12/20/2010 12:36:42 PM Profildateienpfad P:\install\
> 12/20/2010 12:36:42 PM Bootmode BKSTD
> 12/20/2010 12:36:42 PM trying to connect remote resource "\\staging-opsi\opt_pcbin" to local resource "P:"
> 12/20/2010 12:37:03 PM Fehler 53 ("The network path was not found")
> 12/20/2010 12:43:20 PM try_network_connect set to false. Reached time out 180 secs
Switching the drive to "Z:" on the registry and rebooting the machine does *not* seem to fix the issue.
After rebooting, the registry says "P:" again.
Could these values be loaded from the samba staging-opsi mount?
I can see the "opt_pcbin on 'Samba 3.0.24 (10.2.71.216)' (Z:)" mount on the machine.
I will try tomorrow to mount on P instead of Z and/or re-install once more (I did re-install from staging-opsi - 10.2.71.216).
The drive can be checked on the registry HKLM\SOFTWARE\opsi.org\shareinfo:
* configdrive
* depotdrive
* utilsdrive
I spotted the problem by reading C:\tmp\logonlog.txt
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, pckey: DONT SHOW IT
12/20/2010 12:36:41 PM reading pckey from file "C:\Program Files\opsi.org\preloginloader\cfg\locked.cfg"
12/20/2010 12:36:41 PM pckey read from file
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\general, tftpserver: staging-opsi.build.mozilla.org
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\general, configlocal: 0
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, user: pcpatch
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, smbusername1: opsiserver\pcpatch
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, try_secondary_user: 0
...
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, depoturl: smb:\\staging-opsi\opt_pcbin\install
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, configurl: smb:\\staging-opsi\opt_pcbin\install
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, utilsurl: smb:\\staging-opsi\opt_pcbin\utils
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, utilsdrive: P:
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, configdrive: P:
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\shareinfo, depotdrive: P:
12/20/2010 12:36:41 PM Error: Requested bitmap could not be loaded
12/20/2010 12:36:41 PM retrieved from registry \SOFTWARE\opsi.org\pcptch, button_stopnetworking: immediate
12/20/2010 12:36:41 PM Button StopNetworking enabled
12/20/2010 12:36:41 PM servicehost reached in 0 ms
12/20/2010 12:36:41 PM determining opsi client ID
12/20/2010 12:36:41 PM opsi service with URL https://10.2.71.216:4447
12/20/2010 12:36:41 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getHostId%22,%22params%22:%5B%5D%7D"
12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":""}"
12/20/2010 12:36:42 PM No client ID got from service
12/20/2010 12:36:42 PM Try with ipname from local system: talos-r3-xp-001
12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getDomain%22,%22params%22:%5B%5D%7D"
12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":"build.mozilla.org"}"
12/20/2010 12:36:42 PM Default domain from service >build.mozilla.org<
12/20/2010 12:36:42 PM we supplement default domain from service to name: talos-r3-xp-001
12/20/2010 12:36:42 PM We have client ID talos-r3-xp-001.build.mozilla.org
12/20/2010 12:36:42 PM opsi service with URL https://10.2.71.216:4447
12/20/2010 12:36:42 PM opsi service "https://10.2.71.216:4447", client "talos-r3-xp-001.build.mozilla.org" , username "talos-r3-xp-001.build.mozilla.org"
12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22authenticated%22,%22params%22:%5B%5D%7D"
12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":true}"
12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getNetworkConfig%5Fhash%22,%22params%22:%5B%22talos-r3-xp-001.build.mozilla.org%22%5D%7D"
12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":{"depotDrive":"P:","nextBootServiceURL":"https://10.2.71.216:4447","utilsUrl":"smb://staging-opsi/opt_pcbin/utils","configUrl":"smb://staging-opsi/opt_pcbin/pcpatch","utilsDrive":"P:","opsiServer":"staging-opsi.build.mozilla.org","nextBootServerType":"service","depotUrl":"smb://staging-opsi/opt_pcbin/install","depotId":"staging-opsi.build.mozilla.org","configDrive":"O:","winDomain":"buildnet"}}"
12/20/2010 12:36:42 PM depoturl for client talos-r3-xp-001.build.mozilla.org from opsi service: smb://staging-opsi/opt_pcbin/install
12/20/2010 12:36:42 PM utilsurl for client talos-r3-xp-001.build.mozilla.org from opsi service: smb://staging-opsi/opt_pcbin/utils
12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getPcpatchPassword%22,%22params%22:%5B%22talos-r3-xp-001.build.mozilla.org%22%5D%7D"
12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":"19d9ee3761ec4f746981e62dcad55a0e"}"
12/20/2010 12:36:42 PM fetched encryptedpass from opsi service
12/20/2010 12:36:42 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22getGeneralConfig%5Fhash%22,%22params%22:%5B%22talos-r3-xp-001.build.mozilla.org%22%5D%7D"
12/20/2010 12:36:42 PM json general Result "{"error":null,"id":1,"result":{"pcptchLabel1":"opsi","pcptchLabel2":"uib","button_stopnetworking":"","pcptchBitmap1":"winst1.bmp","pcptchBitmap2":"winst2.bmp","secsUntilConnectionTimeOut":"180"}}"
12/20/2010 12:36:42 PM ipNameHost: /* not found */
12/20/2010 12:36:42 PM depoturl: smb://staging-opsi/opt_pcbin/install
12/20/2010 12:36:42 PM configurl: smb://staging-opsi/opt_pcbin/install
12/20/2010 12:36:42 PM utilsurl: smb://staging-opsi/opt_pcbin/utils
12/20/2010 12:36:42 PM depot drive: P:
12/20/2010 12:36:42 PM config drive: P:
12/20/2010 12:36:42 PM utils drive: P:
12/20/2010 12:36:42 PM depotshare \\staging-opsi\opt_pcbin
12/20/2010 12:36:42 PM depotdir \install
12/20/2010 12:36:42 PM configshare \\staging-opsi\opt_pcbin
12/20/2010 12:36:42 PM configdir \install
12/20/2010 12:36:42 PM utilsshare \\staging-opsi\opt_pcbin
12/20/2010 12:36:42 PM utilsdir \utils
12/20/2010 12:36:42 PM winstMasterDirectory P:\install\opsi-winst\files\opsi-winst
12/20/2010 12:36:42 PM Profildateienpfad P:\install\
12/20/2010 12:36:42 PM Bootmode BKSTD
12/20/2010 12:36:42 PM trying to connect remote resource "\\staging-opsi\opt_pcbin" to local resource "P:"
12/20/2010 12:37:03 PM Fehler 53 ("The network path was not found")
12/20/2010 12:43:20 PM try_network_connect set to false. Reached time out 180 secs
12/20/2010 12:43:20 PM user of the process: pcpatch
12/20/2010 12:43:20 PM ending pcptch
12/20/2010 12:43:20 PM json call: "https://10.2.71.216:4447/rpc?%7B%22id%22:1,%22method%22:%22authenticated%22,%22params%22:%5B%5D%7D"
12/20/2010 12:43:20 PM json general Result "{"error":null,"id":1,"result":true}"
12/20/2010 12:43:20 PM Initiating log off
12/20/2010 12:43:20 PM WinstRegRebootVar 0
12/20/2010 12:43:20 PM WinstRegFinalShutdownVar 0
12/20/2010 12:43:20 PM According to registry key in HKLM\SOFTWARE\opsi.org\winst, Variable RebootRequested resp. ShutdownRequested: no shutdown and no logoff
Reporter | ||
Updated•14 years ago
|
Summary: Re-imaged staging slave does not point to the right network drive → Re-imaged staging slave does not point to the right OPSI network drive
Reporter | ||
Comment 1•14 years ago
|
||
I will deal with this once I come back on January.
Assignee: nobody → armenzg
Priority: -- → P3
Summary: Re-imaged staging slave does not point to the right OPSI network drive → Re-imaged talos-r3-xp-00{1,2} staging slaves do not point to the right OPSI network drive
Reporter | ||
Comment 2•14 years ago
|
||
Putting it into the queue in case anyone wants to debug it next week.
Assignee: armenzg → nobody
Priority: P3 → --
Summary: Re-imaged talos-r3-xp-00{1,2} staging slaves do not point to the right OPSI network drive → Re-imaged talos-r3-xp-00{1,2} staging slaves want to access the P drive rather than the Z drive for OPSI
Whiteboard: [opsi] → [opsi][buildduty]
Assignee | ||
Comment 3•14 years ago
|
||
I think this is also causing the issue in bug 611923.
Reporter | ||
Comment 5•14 years ago
|
||
talos-r3-xp-020 suffers of the same illness.
Summary: Re-imaged talos-r3-xp-00{1,2} staging slaves want to access the P drive rather than the Z drive for OPSI → talos-r3-xp-0{01,02,20} have older passwords - Fehler 53 ("The network path was not found") - (slaves want to access the P drive rather than the Z drive for OPSI)
Reporter | ||
Comment 6•14 years ago
|
||
List of slaves hitting the Fehler 53 problem and therefore have older passwords (which means cannot receive OPSI packages).
* talos-r3-xp-001
* talos-r3-xp-002
* talos-r3-xp-005
* talos-r3-xp-007
* talos-r3-xp-009
* talos-r3-xp-013
* talos-r3-xp-014
* talos-r3-xp-015
* talos-r3-xp-018
* talos-r3-xp-019
* talos-r3-xp-020
* talos-r3-xp-032
* talos-r3-xp-033
* talos-r3-xp-034
* talos-r3-xp-035
* talos-r3-xp-036
* talos-r3-xp-037
* talos-r3-xp-038
* talos-r3-xp-039
* talos-r3-xp-040
Severity: normal → major
OS: Mac OS X → Windows XP
Summary: talos-r3-xp-0{01,02,20} have older passwords - Fehler 53 ("The network path was not found") - (slaves want to access the P drive rather than the Z drive for OPSI) → 20 talos-r3-xp slaves have older passwords - Fehler 53 ("The network path was not found") - (slaves want to access the P drive rather than the Z drive for OPSI)
Comment 7•14 years ago
|
||
<grumpy nthomas> It just isn't acceptable that we have machines that are
1) not in sync with OPSI
2) taking an additional 7 minutes to do a reboot
Is this a recently discovered issue, or did something change to cause it ? What is the action plan to resolve this issue ASAP ?
Comment 8•14 years ago
|
||
How difficult is it to
1) disable OPSI
2) install the package by hand via VNC?
I think that the long-term fix is to use Puppet instead. But that requires a lot of preparatory work, so doesn't qualify as "ASAP"
Comment 9•14 years ago
|
||
The word on the street is that Z is where we installed opsi from originally but it isn't used since, so we don't have to convince opsi to use that now.
I compared c:\tmp\logonlog.txt for talos-r3-xp-039 (not up to date) with talos-r3-xp-041 (up to date) and the difference is
# in the ipconfig/all output
-Primary Dns Suffix . . . . . . . :
+Primary Dns Suffix . . . . . . . : build.mozilla.org
-DNS Suffix Search List. . . . . . : build.scl1.mozilla.com build.mozilla.org
+DNS Suffix Search List. . . . . . : build.mozilla.org
+ build.scl1.mozilla.com build.mozilla.org
+ mozilla.org
# later
-ipName from system: talos-r3-xp-039
+ipName from system: talos-r3-xp-041.build.mozilla.org
# trying to get the mount
trying to connect remote resource "\\production-opsi\opt_pcbin" to local resource "P:"
- Fehler 53 ("The network path was not found")
-try_network_connect set to false. Reached time out 180 secs
+ Fehler 127 ("The specified procedure could not be found")
+Netzverbindung von P: zu \\production-opsi\opt_pcbin hergestellt
+user of the process: pcpatch
+start working based on the network connection
+Local winst exists and seems to be up to date
Netzv... translates to 'Network connection P: to \\production-opsi\opt_pcbin produced'.
So it looks like the network setup is different on some of these minis. I'm going to bet there was some manual setup involved after they moved to SCL, and that not all these machines are set up the same. The intersection of comment #6 and bug 611441 comment #1 is pretty strong.
We need to investigate if it's a DHCP issue or something set on each slave.
Updated•14 years ago
|
Summary: 20 talos-r3-xp slaves have older passwords - Fehler 53 ("The network path was not found") - (slaves want to access the P drive rather than the Z drive for OPSI) → 20 talos-r3-xp slaves have older passwords - network setup issue
Comment 10•14 years ago
|
||
talos-r3-xp-039:
----------------
C:\Documents and Settings\cltbld>nslookup production-opsi
Server: ns1.infra.scl1.mozilla.com
Address: 10.12.75.10
*** ns1.infra.scl1.mozilla.com can't find production-opsi: Non-existent domain
talos-r3-xp-041:
----------------
C:\Documents and Settings\cltbld>nslookup production-opsi
Server: ns1.infra.scl1.mozilla.com
Address: 10.12.75.10
Non-authoritative answer:
Name: production-opsi.build.sjc1.mozilla.com
Address: 10.2.71.64
Aliases: production-opsi.build.mozilla.org
So I set the Primary DNS suffix on xp-039 (steps in next comment). On reboot it loaded OPSI quickly and updated the password. production-opsi resolves, obviously.
The only delta's from the logonlog.txt from xp-041 are all the things you'd expect (timestamps, hostname, mac address, opsi key info). The 'DNS Suffix Search List's match.
Comment 11•14 years ago
|
||
Action item:
For each host that needs updating, do
* Connect with VNC while slave idle
* Start menu
* Right click on My Computer
* Properties option
* Computer Name tab
* Change button
* More button
* Set Primary DNS suffix field set to 'build.mozilla.org'
* Confirm NetBIOS name is still talos-r3-xp-XXX
* OK buttons until you can restart
Comment 12•14 years ago
|
||
All hosts in comment 6 added to the slave spreadsheet.
It looks like a working system can be detected by
* nslookup production-opsi works
* P: mounted
* changed password after reboot
Comment 13•14 years ago
|
||
P won't be visible as a mount. It's mounted by OPSI at boot time, and then unmounted again.
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Assignee | ||
Comment 14•14 years ago
|
||
Update:
* talos-r3-xp-00[1-3] are waiting on re-imaging (added relevant bugs to dependencies)
* talos-r3-xp-019 has had its suffic updated, but didn't shutdown cleanly. Added to the reboots bug (https://bugzilla.mozilla.org/show_bug.cgi?id=620948#c69)
* all other slaves updated successfully from OPSI after following the steps in comment #11
Assignee | ||
Updated•14 years ago
|
Assignee | ||
Comment 15•14 years ago
|
||
Updated update after looking at the right slaves (ahem):
* talos-r3-xp-001 also needs a reboot (https://bugzilla.mozilla.org/show_bug.cgi?id=620948#c70)
* talos-r3-xp-002 needs re-imaging (bug 628037)
* talos-r3-xp-003 is successfully updated
So, we're still waiting on
* talos-r3-xp-001: reboot
* talos-r3-xp-002: re-image
* talos-r3-xp-019: reboot
Assignee | ||
Comment 16•14 years ago
|
||
talos-r3-xp-001 and talos-r3-xp-019 have been resurrected.
No longer depends on: 620948
Assignee | ||
Comment 17•14 years ago
|
||
Just fixed talos-r3-xp-002, so we're done here.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 18•14 years ago
|
||
Fantastic! Thanks coop.
Reporter | ||
Comment 19•14 years ago
|
||
Very awesome!
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•