Closed Bug 788382 Opened 13 years ago Closed 13 years ago

New WinXP slaves intermittently taking 7 hours to wind up disconnected

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

x86
Windows XP

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: armenzg)

References

Details

(Whiteboard: [capacity][buildduty])

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=14960453&tree=Mozilla-Inbound talos-r3-xp-087 TEST-START | chrome://mochitests/content/browser/browser/components/tabview/test/browser_tabview_bug662266.js remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] ========= Finished 'python mochitest/runtests.py ...' interrupted (results: 4, elapsed: 7 hrs, 9 mins, 7 secs) (at 2012-09-04 18:39:25.727862) ========= https://tbpl.mozilla.org/php/getParsedLog.php?id=14954485&tree=Mozilla-Inbound talos-r3-xp-090 TEST-INFO | chrome://mochitests/content/browser/dom/indexedDB/test/browser_quotaPromptDelete.js | dispatching event: indexedDB-addMore remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] ========= Finished 'python mochitest/runtests.py ...' interrupted (results: 4, elapsed: 7 hrs, 36 mins, 54 secs) (at 2012-09-04 15:10:15.435786) ========= https://tbpl.mozilla.org/php/getParsedLog.php?id=14953423&tree=Mozilla-Inbound talos-r3-xp-083 TEST-INFO | C:\talos-slave\test\build\xpcshell\tests\toolkit\components\places\tests\unit\test_415757.js | running test ... remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] ========= Finished 'bash -c ...' interrupted (results: 4, elapsed: 7 hrs, 7 mins, 13 secs) (at 2012-09-04 14:35:08.227435) =========
Whiteboard: [capacity] → [capacity][buildduty]
Assignee: nobody → armenzg
FTR the slaves were re-imaged on bug 786036.
Depends on: 786036
It seems that the slaves were added to OPSI manually rather than through production-slaves file. This is why some OPSI packages are missing. Slaves 76 to 95 seem to only have these packages installed on OPSI: * twisted_dumbwin32proc * vc2005_debugcrt_x86 * vc2005sp1_redist * vc2010_debugcrt_x86 * winxp-graphic-driver I have deleted manually xp-084 from OPSI and added it to production-slaves. The cronjob has added the slave back based on the talos-r3-xp-ref entry/template in OPSI. This slave now shows the same OPSI packages as other production slaves. Comparing two slaves reveals the different packages (http://cl.ly/JJLJ): * cleanup * nagios_config * passwordupdate * preloginloader * python-2.7 * remove-psskill * time-server * toolsrepo * twisted_dumbwin32proc ** to my surprise in prod slaves this package is marked as "not_installed" but the actual ref image provides the replacement file rather than the OPSI package I will provide a patch for adjusting the slaves but I can't promise that all problems will be resolved. Some of these slaves are running on staging and I hope I can reproduce the issues. TODO: document this stuff properly I also moved the log because it was huge: production-opsi:~# cp new-slaves.log new-slaves.log.20120907 production-opsi:~# rm new-slaves.log && touch new-slaves.log [1] http://hg.mozilla.org/build/opsi-package-sources/file/default/look-for-new-slaves.py#l82
Attached patch adjust list of slaves in OPSI — — Splinter Review
Attachment #659338 - Flags: review?(coop)
Maybe this is the cause? https://bugzilla.mozilla.org/show_bug.cgi?id=548330 (In reply to Ben Hearsum [:bhearsum] from comment #0) > We used to 'pskill' to kill firefox.exe and other processes at the start of > a unittest run. With proper Twisted process management and reboots after > every build we no longer need it. Additionally, it often causes problems on > new slaves because it requires a license to be click-throughed. We should > stop running it in UnittestBuildFactory, and delete it from the slaves.
I disabled slaves 76 to 95 until we fix this.
Attachment #659338 - Flags: review?(coop) → review+
Blocks: 789580
(In reply to Armen Zambrano G. [:armenzg] from comment #2) > TODO: document this stuff properly Wow, no kidding. I consider myself reasonably well acquainted with OPSI, and I didn't even know production-slaves existed. Worse, it's not mentioned anywhere in our OPSI docs or even in the entire wiki. Also, why should it matter if the packages are marked as installed or not? Doesn't the base image for XP already contain all of the packages listed on the ref platform page? Given that our documentation for OPSI and ref platforms is at best out-of-date and at worst terrible, how are we to know what the real state of these machines is if we copy just copy the state of installed state of the ref platform packages? We should either use OPSI like puppet and actually install packages with it so we have a consistent, known state, or stop using OPSI all together and go strictly with reference images.
(In reply to Chris Cooper [:coop] from comment #7) > (In reply to Armen Zambrano G. [:armenzg] from comment #2) > > TODO: document this stuff properly > > Also, why should it matter if the packages are marked as installed or not? > It does not. > Doesn't the base image for XP already contain all of the packages listed on > the ref platform page? I don't know. It should. > Given that our documentation for OPSI and ref > platforms is at best out-of-date and at worst terrible, how are we to know > what the real state of these machines is if we copy just copy the state of > installed state of the ref platform packages? > I don't know. We can discuss it on Monday call. > > We should either use OPSI like puppet and actually install packages with it > so we have a consistent, known state, or stop using OPSI all together and go > strictly with reference images. > You now know to a deeper degree why we don't like OPSI.
I'm not sure if it was OPSI after all. I won't land the patch for now as I don't want to be distracted by it. I will get back to this unless someone else wants to pick it up. I need to deal with the metro nightly builds on elm and the win7 multinode evaluation. FTR I'm running on talos-r3-xp-084 manually and see what it yields: C:\mozilla-build\python25\python.exe mochitest/runtests.py --appname=firefox/firefox.exe --utility-path=bin --extra-profile-file=bin/plugins --certificate-path=certs --autorun --close-when-done --console-level=INFO --symbols-path=http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1346387570/firefox-18.0a1.en-US.win32.crashreporter-symbols.zip --total-chunks 5 --this-chunk 5 --chunk-by-dir 4 I'm also going to ask IT to reimage one of the slaves for me and set it up again.
Priority: -- → P3
Depends on: 790082
Next thing to try is to setup talos-r3-xp-082 again (re-imaged from bug 790082), sync it with OPSI and run it through staging to see if it has the same issue.
This seems to also affect older minis when they get re-imaged (see bug 751115).
Priority: P3 → P2
I have put talos-r3-xp-082 through staging after: 1) adjusting the hostname 2) removing the OPSI entry for it 3) rebooted it
No changes. I get the same: "ERROR: Logon failure: unknown user name or bad password." I will dig in the code: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py
To my surprise, I thought I had checked if _dumbwin32proc.py was the same as on other slaves. It seems I was wrong. I have marked the packaged to be installed and re-triggered a job. Let's hope is that, take the embarrassment and move on. I also noticed that on a successful job I see the following SUCCESS lines: SUCCESS: The process with PID 3116 has been terminated. SUCCESS: The process with PID 1900 has been terminated. SUCCESS: The process with PID 692 has been terminated. which come from these processes: INFO | automation.py | SSL tunnel pid: 3116 INFO | runtests.py | Server pid: 1900 INFO | runtests.py | Websocket server pid: 692
_dumbwin32proc.py does not seem to be involved. jmaher, jlebar, (I saw your names for code changes): I have modified the code to not run the automation (see below) and I still got the "ERROR: Logon failure: unknown user name or bad password." I have looked at the task manager and I can see the PIDs that get started: 1668 and 3264. I don't see the SSL tunel pid this time as I saw in comment 15. Probably because I commented a chunk of the code out. Any ideas? ######################### Code changes: self.automation.log.info("INFO | runtests.py | Running tests: start.\n") + ''' try: status = self.automation.runApp(testURL, browserEnv, options.app, options.profilePath, options.browserArgs, runSSLTunnel = self.runSSLTunnel, utilityPath = options.utilityPath, xrePath = options.xrePath, certPath=options.certPath, debuggerInfo=debuggerInfo, symbolsPath=options.symbolsPath, timeout = timeout) except KeyboardInterrupt: self.automation.log.info("INFO | runtests.py | Received keyboard interrupt .\n"); status = -1 except: self.automation.log.exception("INFO | runtests.py | Received unexpected ex ception while running application\n") status = 1 + ''' + status = 0 ######################### Here is the log I got: C:\talos-slave\test\build>C:\mozilla-build\python25\python.exe mochitest/runtest s.py --appname=firefox/firefox.exe --utility-path=bin --extra-profile-file=bin/p lugins --certificate-path=certs --autorun --close-when-done --console-level=INFO --symbols-path=http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/ mozilla-central-win32/1346387570/firefox-18.0a1.en-US.win32.crashreporter-symbol s.zip --total-chunks 5 --this-chunk 2 --chunk-by-dir 4 INFO | runtests.py | Installing extension at C:\talos-slave\test\build\mochitest \extensions\specialpowers to c:\docume~1\cltbld\locals~1\temp\tmpkmypbj. INFO | runtests.py | Installing extension at C:\talos-slave\test\build\mochitest \extensions\worker to c:\docume~1\cltbld\locals~1\temp\tmpkmypbj. INFO | runtests.py | Installing extension at C:\talos-slave\test\build\mochitest \extensions\workerbootstrap to c:\docume~1\cltbld\locals~1\temp\tmpkmypbj. args: ['C:\\talos-slave\\test\\build\\bin\\xpcshell.exe', '-g', 'C:\\talos-slave \\test\\build\\firefox', '-v', '170', '-f', './httpd.js', '-e', "const _PROFILE_ PATH = 'c:\\\\docume~1\\\\cltbld\\\\locals~1\\\\temp\\\\tmpkmypbj';const _SERVER _PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined ;", '-f', './server.js'] INFO | runtests.py | Server pid: 1668 args: ['C:\\mozilla-build\\python25\\python.exe', 'C:\\talos-slave\\test\\build\ \mochitest\\pywebsocket_wrapper.py', '-p', '9988', '-w', 'C:\\talos-slave\\test\ \build\\mochitest', '-l', 'C:\\talos-slave\\test\\build\\mochitest\\websock.log' , '--log-level=debug', '--allow-handlers-outside-root-dir'] INFO | runtests.py | Websocket server pid: 3264 INFO | runtests.py | Running tests: start. ERROR: Logon failure: unknown user name or bad password. ERROR: Logon failure: unknown user name or bad password. WARNING | automationutils.processLeakLog() | refcount logging is off, so leaks c an't be detected! INFO | runtests.py | Running tests: end.
The problem is right here: subprocess.Popen(["taskkill", "/F", "/PID", pid]).wait() I have tried also other ways of using taskkill manually: "taskill /S system /U cltbld /F /PID <number> but got: "ERROR: The RPC server is unavailable" http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#191 191 def kill(self): 192 if Automation().IS_WIN32: 193 import platform 194 pid = "%i" % self.pid 195 if platform.release() == "2000": 196 # Windows 2000 needs 'kill.exe' from the 197 #'Windows 2000 Resource Kit tools'. (See bug 475455.) 198 try: 199 subprocess.Popen(["kill", "-f", pid]).wait() 200 except: 201 self.log.info("TEST-UNEXPECTED-FAIL | automation.py | Missing 'kill' utility to kill process with pid=%s. Kill it manually!", pid) 202 else: 203 # Windows XP and later. 204 subprocess.Popen(["taskkill", "/F", "/PID", pid]).wait() 205 else: 206 os.kill(self.pid, signal.SIGKILL)
I can't run: tasklist /svc on any WinXP machine that gets re-imaged. I can do so in any older machine. I cannot restart the RPC services. I've searched for "taskkill rpc server is unavailable" but I have not yet found a solution that works. There is something very very fishy. MaRu, do you have any ideas?
Every time I try to run "tasklist /svc" I get a nasty error event: http://cl.ly/JiWi Event Type: Error Event Source: DCOM Event Category: None Event ID: 10000 Date: 9/25/2012 Time: 2:25:36 PM User: NT AUTHORITY\NETWORK SERVICE Computer: TALOS-R3-XP-082 Description: Unable to start a DCOM Server: {73E709EA-5D93-4B2E-BBB0-99B7938DA9E4}. The error: "Access is denied. " Happened while starting this command: C:\WINDOWS\system32\wbem\wmiprvse.exe -Embedding For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Reading back through this bug, based on comment 11, are you saying that older machines that are reimaged are also having this problem? That points to some change that you guys made to the ref machine before the last capture you asked us to do. We can roll back to using the previous image (which also has the older password) if that helps.
Depends on: 794248
(In reply to Amy Rich [:arich] [:arr] from comment #20) > Reading back through this bug, based on comment 11, are you saying that > older machines that are reimaged are also having this problem? That points > to some change that you guys made to the ref machine before the last capture > you asked us to do. We can roll back to using the previous image (which > also has the older password) if that helps. This is correct. It happens on older images. I have checked talos-r3-xp-ref and has no problems. Maybe we need a new snapshot? I have filed bug 794248 for now.
(In reply to Amy Rich [:arich] [:arr] from comment #20) > Reading back through this bug, based on comment 11, are you saying that > older machines that are reimaged are also having this problem? That points > to some change that you guys made to the ref machine before the last capture > you asked us to do. We can roll back to using the previous image (which > also has the older password) if that helps. Actually, from reading bug 794248 comment 0, another alternative is the latest image has some corruption. (i.e. the ref image works, a newly imaged machine does not work) Are there any downsides to taking a fresh image of talos-r3-xp-ref?
I got this from the newsgroups: Armen Zambrano G. wrote: > Hi all, > Recently we've set up some new XP slaves that fail on mochitest jobs due to this error: > "ERROR: Logon failure: unknown user name or bad password." > > Unfortunately, I have not been able what is triggering that error (I can't find the strings in our code). > > If you're curious or have any ideas please let me know in bug 788382 Perhaps the file system permissions are incorrect and NT AUTHORITY\NETWORK SERVICE doesn't have permission to access C:\WINDOWS\system32\wbem\wmiprvse.exe ?
It seems these 2 articles touch on the topic but I don't know how check and change the permissions on Windows. http://blog.assarbad.net/20110126/dcom-error/ http://support.microsoft.com/kb/313222
Taking a step back. Not *all* re-imaged slaves have this symptom. These slaves are ill: * talos-r3-xp-001 * talos-r3-xp-063 * talos-r3-xp-079 * talos-r3-xp-081 * talos-r3-xp-082 * talos-r3-xp-084 * talos-r3-xp-085 * talos-r3-xp-086 * talos-r3-xp-088 In fact, I will put the healthy slaves back to the pool (talos-r3-xp-0{77,78,80,83,87,89,90}) except talos-r3-xp-076 for debugging purposes. Let's look talos-r3-xp-076 and experiment. Its OPSI sync time is of 2012-09-07 22:58:04 I don't see any differences in OPSI packages between an ill slave and a this healthy slave (within this same batch of reimaged slaves). Rebooting it and letting it go through OPSI does nothing negative (tasklist /svc still works). I will wait for talos-r3-xp-0[91-95] to be re-imaged and see what info fresh slaves can throw in.
Only 4 slaves are left to be fixed. They will be left for IT to do debugging on where the re-imaging process is misbehaving. This is fixed but we will have to keep an eye on XP slaves that get re-imaged.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Not depending on taskkill would be a long term project but here is some info about it. > On the other hand, do you think you could point me at the mozbase code > so I can determine what you guys do instead of > subprocess.Popen(["taskkill", "/F", "/PID", pid]).wait() [1] > Well, it's not something you can do in one line. We use IOCompletion ports to be certain that we can kill all the child processes spawned from the parent. Converting this over is going to be an undertaking, but it will be worth it in the end. Here is the code: https://github.com/mozilla/mozbase/tree/master/mozprocess Cheers, Clint
No longer depends on: 794248
(In reply to Armen Zambrano G. [:armenzg] from comment #26) > Only 4 slaves are left to be fixed. > They will be left for IT to do debugging on where the re-imaging process is > misbehaving. The re-image debug work is being done in bug 794248
Product: mozilla.org → Release Engineering
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: