Closed Bug 788382 Opened 13 years ago Closed 13 years ago

New WinXP slaves intermittently taking 7 hours to wind up disconnected

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

Product:

Component:

Platform:

x86

Windows XP

Type:

task

Priority:

P2

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: philor, Assigned: armenzg)

References

Details

(Whiteboard: [capacity][buildduty])

Attachments

(1 file)

adjust list of slaves in OPSI 13 years ago Armen [:armenzg] 1.90 KB, patch	coop : review+ armenzg : checked-in+	Details \| Diff \| Splinter Review

Phil Ringnalda (:philor)

Reporter

Description

•

13 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=14960453&tree=Mozilla-Inbound talos-r3-xp-087 TEST-START | chrome://mochitests/content/browser/browser/components/tabview/test/browser_tabview_bug662266.js remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] ========= Finished 'python mochitest/runtests.py ...' interrupted (results: 4, elapsed: 7 hrs, 9 mins, 7 secs) (at 2012-09-04 18:39:25.727862) ========= https://tbpl.mozilla.org/php/getParsedLog.php?id=14954485&tree=Mozilla-Inbound talos-r3-xp-090 TEST-INFO | chrome://mochitests/content/browser/dom/indexedDB/test/browser_quotaPromptDelete.js | dispatching event: indexedDB-addMore remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] ========= Finished 'python mochitest/runtests.py ...' interrupted (results: 4, elapsed: 7 hrs, 36 mins, 54 secs) (at 2012-09-04 15:10:15.435786) ========= https://tbpl.mozilla.org/php/getParsedLog.php?id=14953423&tree=Mozilla-Inbound talos-r3-xp-083 TEST-INFO | C:\talos-slave\test\build\xpcshell\tests\toolkit\components\places\tests\unit\test_415757.js | running test ... remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] ========= Finished 'bash -c ...' interrupted (results: 4, elapsed: 7 hrs, 7 mins, 13 secs) (at 2012-09-04 14:35:08.227435) =========

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Whiteboard: [capacity] → [capacity][buildduty]

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Assignee: nobody → armenzg

Armen [:armenzg]

Assignee

Comment 1

•

13 years ago

FTR the slaves were re-imaged on bug 786036.

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Depends on: 786036

Armen [:armenzg]

Assignee

Comment 2

•

13 years ago

It seems that the slaves were added to OPSI manually rather than through production-slaves file. This is why some OPSI packages are missing. Slaves 76 to 95 seem to only have these packages installed on OPSI: * twisted_dumbwin32proc * vc2005_debugcrt_x86 * vc2005sp1_redist * vc2010_debugcrt_x86 * winxp-graphic-driver I have deleted manually xp-084 from OPSI and added it to production-slaves. The cronjob has added the slave back based on the talos-r3-xp-ref entry/template in OPSI. This slave now shows the same OPSI packages as other production slaves. Comparing two slaves reveals the different packages (http://cl.ly/JJLJ): * cleanup * nagios_config * passwordupdate * preloginloader * python-2.7 * remove-psskill * time-server * toolsrepo * twisted_dumbwin32proc ** to my surprise in prod slaves this package is marked as "not_installed" but the actual ref image provides the replacement file rather than the OPSI package I will provide a patch for adjusting the slaves but I can't promise that all problems will be resolved. Some of these slaves are running on staging and I hope I can reproduce the issues. TODO: document this stuff properly I also moved the log because it was huge: production-opsi:~# cp new-slaves.log new-slaves.log.20120907 production-opsi:~# rm new-slaves.log && touch new-slaves.log [1] http://hg.mozilla.org/build/opsi-package-sources/file/default/look-for-new-slaves.py#l82

Armen [:armenzg]

Assignee

Comment 3

•

13 years ago

Attached patch adjust list of slaves in OPSI — Details — Splinter Review

Attachment #659338 - Flags: review?(coop)

Armen [:armenzg]

Assignee

Comment 4

•

13 years ago

Maybe this is the cause? https://bugzilla.mozilla.org/show_bug.cgi?id=548330 (In reply to Ben Hearsum [:bhearsum] from comment #0) > We used to 'pskill' to kill firefox.exe and other processes at the start of > a unittest run. With proper Twisted process management and reboots after > every build we no longer need it. Additionally, it often causes problems on > new slaves because it requires a license to be click-throughed. We should > stop running it in UnittestBuildFactory, and delete it from the slaves.

Armen [:armenzg]

Assignee

Comment 5

•

13 years ago

I disabled slaves 76 to 95 until we fix this.

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

Attachment #659338 - Flags: review?(coop) → review+

Updated

•

13 years ago

Blocks: 789580

Chris Cooper [:coop] (he/him)

Comment 7

•

13 years ago

(In reply to Armen Zambrano G. [:armenzg] from comment #2) > TODO: document this stuff properly Wow, no kidding. I consider myself reasonably well acquainted with OPSI, and I didn't even know production-slaves existed. Worse, it's not mentioned anywhere in our OPSI docs or even in the entire wiki. Also, why should it matter if the packages are marked as installed or not? Doesn't the base image for XP already contain all of the packages listed on the ref platform page? Given that our documentation for OPSI and ref platforms is at best out-of-date and at worst terrible, how are we to know what the real state of these machines is if we copy just copy the state of installed state of the ref platform packages? We should either use OPSI like puppet and actually install packages with it so we have a consistent, known state, or stop using OPSI all together and go strictly with reference images.

Armen [:armenzg]

Assignee

Comment 8

•

13 years ago

(In reply to Chris Cooper [:coop] from comment #7) > (In reply to Armen Zambrano G. [:armenzg] from comment #2) > > TODO: document this stuff properly > > Also, why should it matter if the packages are marked as installed or not? > It does not. > Doesn't the base image for XP already contain all of the packages listed on > the ref platform page? I don't know. It should. > Given that our documentation for OPSI and ref > platforms is at best out-of-date and at worst terrible, how are we to know > what the real state of these machines is if we copy just copy the state of > installed state of the ref platform packages? > I don't know. We can discuss it on Monday call. > > We should either use OPSI like puppet and actually install packages with it > so we have a consistent, known state, or stop using OPSI all together and go > strictly with reference images. > You now know to a deeper degree why we don't like OPSI.

Armen [:armenzg]

Assignee

Comment 9

•

13 years ago

I'm not sure if it was OPSI after all. I won't land the patch for now as I don't want to be distracted by it. I will get back to this unless someone else wants to pick it up. I need to deal with the metro nightly builds on elm and the win7 multinode evaluation. FTR I'm running on talos-r3-xp-084 manually and see what it yields: C:\mozilla-build\python25\python.exe mochitest/runtests.py --appname=firefox/firefox.exe --utility-path=bin --extra-profile-file=bin/plugins --certificate-path=certs --autorun --close-when-done --console-level=INFO --symbols-path=http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1346387570/firefox-18.0a1.en-US.win32.crashreporter-symbols.zip --total-chunks 5 --this-chunk 5 --chunk-by-dir 4 I'm also going to ask IT to reimage one of the slaves for me and set it up again.

Priority: -- → P3

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Depends on: 790082

Armen [:armenzg]

Assignee

Comment 10

•

13 years ago

Next thing to try is to setup talos-r3-xp-082 again (re-imaged from bug 790082), sync it with OPSI and run it through staging to see if it has the same issue.

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Blocks: talos-r3-xp-063

Armen [:armenzg]

Assignee

Comment 11

•

13 years ago

This seems to also affect older minis when they get re-imaged (see bug 751115).

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Priority: P3 → P2

Armen [:armenzg]

Assignee

Comment 12

•

13 years ago

Comment on attachment 659338 [details] [diff] [review] adjust list of slaves in OPSI http://hg.mozilla.org/build/opsi-package-sources/rev/e55c081cb8cf

Attachment #659338 - Flags: checked-in+

Armen [:armenzg]

Assignee

Comment 13

•

13 years ago

I have put talos-r3-xp-082 through staging after: 1) adjusting the hostname 2) removing the OPSI entry for it 3) rebooted it

Armen [:armenzg]

Assignee

Comment 14

•

13 years ago

No changes. I get the same: "ERROR: Logon failure: unknown user name or bad password." I will dig in the code: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py

Armen [:armenzg]

Assignee

Comment 15

•

13 years ago

To my surprise, I thought I had checked if _dumbwin32proc.py was the same as on other slaves. It seems I was wrong. I have marked the packaged to be installed and re-triggered a job. Let's hope is that, take the embarrassment and move on. I also noticed that on a successful job I see the following SUCCESS lines: SUCCESS: The process with PID 3116 has been terminated. SUCCESS: The process with PID 1900 has been terminated. SUCCESS: The process with PID 692 has been terminated. which come from these processes: INFO | automation.py | SSL tunnel pid: 3116 INFO | runtests.py | Server pid: 1900 INFO | runtests.py | Websocket server pid: 692

Armen [:armenzg]

Assignee

Comment 16

•

13 years ago

_dumbwin32proc.py does not seem to be involved. jmaher, jlebar, (I saw your names for code changes): I have modified the code to not run the automation (see below) and I still got the "ERROR: Logon failure: unknown user name or bad password." I have looked at the task manager and I can see the PIDs that get started: 1668 and 3264. I don't see the SSL tunel pid this time as I saw in comment 15. Probably because I commented a chunk of the code out. Any ideas? ######################### Code changes: self.automation.log.info("INFO | runtests.py | Running tests: start.\n") + ''' try: status = self.automation.runApp(testURL, browserEnv, options.app, options.profilePath, options.browserArgs, runSSLTunnel = self.runSSLTunnel, utilityPath = options.utilityPath, xrePath = options.xrePath, certPath=options.certPath, debuggerInfo=debuggerInfo, symbolsPath=options.symbolsPath, timeout = timeout) except KeyboardInterrupt: self.automation.log.info("INFO | runtests.py | Received keyboard interrupt .\n"); status = -1 except: self.automation.log.exception("INFO | runtests.py | Received unexpected ex ception while running application\n") status = 1 + ''' + status = 0 ######################### Here is the log I got: C:\talos-slave\test\build>C:\mozilla-build\python25\python.exe mochitest/runtest s.py --appname=firefox/firefox.exe --utility-path=bin --extra-profile-file=bin/p lugins --certificate-path=certs --autorun --close-when-done --console-level=INFO --symbols-path=http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/ mozilla-central-win32/1346387570/firefox-18.0a1.en-US.win32.crashreporter-symbol s.zip --total-chunks 5 --this-chunk 2 --chunk-by-dir 4 INFO | runtests.py | Installing extension at C:\talos-slave\test\build\mochitest \extensions\specialpowers to c:\docume~1\cltbld\locals~1\temp\tmpkmypbj. INFO | runtests.py | Installing extension at C:\talos-slave\test\build\mochitest \extensions\worker to c:\docume~1\cltbld\locals~1\temp\tmpkmypbj. INFO | runtests.py | Installing extension at C:\talos-slave\test\build\mochitest \extensions\workerbootstrap to c:\docume~1\cltbld\locals~1\temp\tmpkmypbj. args: ['C:\\talos-slave\\test\\build\\bin\\xpcshell.exe', '-g', 'C:\\talos-slave \\test\\build\\firefox', '-v', '170', '-f', './httpd.js', '-e', "const _PROFILE_ PATH = 'c:\\\\docume~1\\\\cltbld\\\\locals~1\\\\temp\\\\tmpkmypbj';const _SERVER _PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined ;", '-f', './server.js'] INFO | runtests.py | Server pid: 1668 args: ['C:\\mozilla-build\\python25\\python.exe', 'C:\\talos-slave\\test\\build\ \mochitest\\pywebsocket_wrapper.py', '-p', '9988', '-w', 'C:\\talos-slave\\test\ \build\\mochitest', '-l', 'C:\\talos-slave\\test\\build\\mochitest\\websock.log' , '--log-level=debug', '--allow-handlers-outside-root-dir'] INFO | runtests.py | Websocket server pid: 3264 INFO | runtests.py | Running tests: start. ERROR: Logon failure: unknown user name or bad password. ERROR: Logon failure: unknown user name or bad password. WARNING | automationutils.processLeakLog() | refcount logging is off, so leaks c an't be detected! INFO | runtests.py | Running tests: end.

Armen [:armenzg]

Assignee

Comment 17

•

13 years ago

The problem is right here: subprocess.Popen(["taskkill", "/F", "/PID", pid]).wait() I have tried also other ways of using taskkill manually: "taskill /S system /U cltbld /F /PID <number> but got: "ERROR: The RPC server is unavailable" http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#191 191 def kill(self): 192 if Automation().IS_WIN32: 193 import platform 194 pid = "%i" % self.pid 195 if platform.release() == "2000": 196 # Windows 2000 needs 'kill.exe' from the 197 #'Windows 2000 Resource Kit tools'. (See bug 475455.) 198 try: 199 subprocess.Popen(["kill", "-f", pid]).wait() 200 except: 201 self.log.info("TEST-UNEXPECTED-FAIL | automation.py | Missing 'kill' utility to kill process with pid=%s. Kill it manually!", pid) 202 else: 203 # Windows XP and later. 204 subprocess.Popen(["taskkill", "/F", "/PID", pid]).wait() 205 else: 206 os.kill(self.pid, signal.SIGKILL)

Armen [:armenzg]

Assignee

Comment 18

•

13 years ago

I can't run: tasklist /svc on any WinXP machine that gets re-imaged. I can do so in any older machine. I cannot restart the RPC services. I've searched for "taskkill rpc server is unavailable" but I have not yet found a solution that works. There is something very very fishy. MaRu, do you have any ideas?

Armen [:armenzg]

Assignee

Comment 19

•

13 years ago

Every time I try to run "tasklist /svc" I get a nasty error event: http://cl.ly/JiWi Event Type: Error Event Source: DCOM Event Category: None Event ID: 10000 Date: 9/25/2012 Time: 2:25:36 PM User: NT AUTHORITY\NETWORK SERVICE Computer: TALOS-R3-XP-082 Description: Unable to start a DCOM Server: {73E709EA-5D93-4B2E-BBB0-99B7938DA9E4}. The error: "Access is denied. " Happened while starting this command: C:\WINDOWS\system32\wbem\wmiprvse.exe -Embedding For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Amy Rich [:arr] [:arich]

Comment 20

•

13 years ago

Reading back through this bug, based on comment 11, are you saying that older machines that are reimaged are also having this problem? That points to some change that you guys made to the ref machine before the last capture you asked us to do. We can roll back to using the previous image (which also has the older password) if that helps.

Armen [:armenzg]

Assignee

Updated

•

13 years ago

Depends on: 794248

Armen [:armenzg]

Assignee

Comment 21

•

13 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #20) > Reading back through this bug, based on comment 11, are you saying that > older machines that are reimaged are also having this problem? That points > to some change that you guys made to the ref machine before the last capture > you asked us to do. We can roll back to using the previous image (which > also has the older password) if that helps. This is correct. It happens on older images. I have checked talos-r3-xp-ref and has no problems. Maybe we need a new snapshot? I have filed bug 794248 for now.

Comment 22

•

13 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #20) > Reading back through this bug, based on comment 11, are you saying that > older machines that are reimaged are also having this problem? That points > to some change that you guys made to the ref machine before the last capture > you asked us to do. We can roll back to using the previous image (which > also has the older password) if that helps. Actually, from reading bug 794248 comment 0, another alternative is the latest image has some corruption. (i.e. the ref image works, a newly imaged machine does not work) Are there any downsides to taking a fresh image of talos-r3-xp-ref?

Armen [:armenzg]

Assignee

Comment 23

•

13 years ago

I got this from the newsgroups: Armen Zambrano G. wrote: > Hi all, > Recently we've set up some new XP slaves that fail on mochitest jobs due to this error: > "ERROR: Logon failure: unknown user name or bad password." > > Unfortunately, I have not been able what is triggering that error (I can't find the strings in our code). > > If you're curious or have any ideas please let me know in bug 788382 Perhaps the file system permissions are incorrect and NT AUTHORITY\NETWORK SERVICE doesn't have permission to access C:\WINDOWS\system32\wbem\wmiprvse.exe ?

Armen [:armenzg]

Assignee

Comment 24

•

13 years ago

It seems these 2 articles touch on the topic but I don't know how check and change the permissions on Windows. http://blog.assarbad.net/20110126/dcom-error/ http://support.microsoft.com/kb/313222

Armen [:armenzg]

Assignee

Comment 25

•

13 years ago

Taking a step back. Not *all* re-imaged slaves have this symptom. These slaves are ill: * talos-r3-xp-001 * talos-r3-xp-063 * talos-r3-xp-079 * talos-r3-xp-081 * talos-r3-xp-082 * talos-r3-xp-084 * talos-r3-xp-085 * talos-r3-xp-086 * talos-r3-xp-088 In fact, I will put the healthy slaves back to the pool (talos-r3-xp-0{77,78,80,83,87,89,90}) except talos-r3-xp-076 for debugging purposes. Let's look talos-r3-xp-076 and experiment. Its OPSI sync time is of 2012-09-07 22:58:04 I don't see any differences in OPSI packages between an ill slave and a this healthy slave (within this same batch of reimaged slaves). Rebooting it and letting it go through OPSI does nothing negative (tasklist /svc still works). I will wait for talos-r3-xp-0[91-95] to be re-imaged and see what info fresh slaves can throw in.

Armen [:armenzg]

Assignee

Comment 26

•

13 years ago

Only 4 slaves are left to be fixed. They will be left for IT to do debugging on where the re-imaging process is misbehaving. This is fixed but we will have to keep an eye on XP slaves that get re-imaged.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Assignee

Comment 27

•

13 years ago

Not depending on taskkill would be a long term project but here is some info about it. > On the other hand, do you think you could point me at the mozbase code > so I can determine what you guys do instead of > subprocess.Popen(["taskkill", "/F", "/PID", pid]).wait() [1] > Well, it's not something you can do in one line. We use IOCompletion ports to be certain that we can kill all the child processes spawned from the parent. Converting this over is going to be an undertaking, but it will be worth it in the end. Here is the code: https://github.com/mozilla/mozbase/tree/master/mozprocess Cheers, Clint

Updated

•

13 years ago

No longer depends on: 794248

Comment 28

•

13 years ago

(In reply to Armen Zambrano G. [:armenzg] from comment #26) > Only 4 slaves are left to be fixed. > They will be left for IT to do debugging on where the re-imaging process is > misbehaving. The re-image debug work is being done in bug 794248

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.