Closed Bug 690232 Opened 13 years ago Closed 13 years ago

Windows slaves: SIGKILL failed to kill process

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 666019

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [windows][builldbot])

I thought this was WinXP-only, because that was the only place I'd seen it, for either weeks or months now, not sure which.

But https://tbpl.mozilla.org/php/getParsedLog.php?id=6592286&tree=Mozilla-Inbound is a Win7 Talos run which hung (because msys blows, and it got a permission denied error trying to clear the cache, that part's uninteresting), and then SIGKILL failed to kill process.

And https://tbpl.mozilla.org/php/getParsedLog.php?id=6590093&tree=Mozilla-Inbound is a Win64 nightly, which timed out in hg, again uninteresting and no surprise, and then SIGKILL failed to kill process.

For completeness, a random chunk of WinXP ones:

https://tbpl.mozilla.org/php/getParsedLog.php?id=6602739&tree=Mozilla-Inbound was a shutdown timeout

https://tbpl.mozilla.org/php/getParsedLog.php?id=6589136&tree=Mozilla-Inbound was a test timeout

https://tbpl.mozilla.org/php/getParsedLog.php?id=6597228&tree=Firefox was a shutdown timeout
Forgot to mention the severity-enhancer that made me actually file: https://tbpl.mozilla.org/php/getParsedLog.php?id=6602739&tree=Mozilla-Inbound was some mochitest-chrome shutdown timeout, which then ate our mochitest-browser-chrome, mochitest-a11y and mochitest-ipcplugins
Oh sigh. I thought these were not an issue anymore.

I bet this [a] got lost with the newer version of buildbot/twisted 10.1 [b]

[a] https://wiki.mozilla.org/ReferencePlatforms/Test/WinXP#Twisted_patch_to_allow_buildbot_to_kill_jobs
[b] https://wiki.mozilla.org/ReferencePlatforms/Test/WinXP#Install_Buildbot

This is awful.

For a little more context:
* w7 tester [1] - OSError: [Errno 13] Permission denied: 'c:\\users\\cltbld\\appdata\\local\\temp\\tmpqo0enl\\profile\\Cache\\_CACHE_001_'
* w64 builders [2] - SIGKILL failed to kill process (after a time out)
* xp tester [3] - SIGKILL failed to kill process (after a time out)

[1]
Running test tp5: 
		Started Wed, 28 Sep 2011 02:19:05
	Screen width/height:1024/768
	colorDepth:24
	Browser inner width/height: 1006/586

NOISE: Cycle 1: loaded http://localhost/page_load_test/tp5/thesartorialist.blogspot.com/thesartorialist.blogspot.com/index.html (next: http://localhost/page_load_test/tp5/cakewrecks.blogspot.com/cakewrecks.blogspot.com/index.html)
Traceback (most recent call last):
  File "run_tests.py", line 540, in ?
    test_file(arg, screen, amo)
  File "run_tests.py", line 485, in test_file
    browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test)
  File "c:\talos-slave\talos-data\talos\ttest.py", line 397, in runTest
    self.cleanupProfile(temp_dir)
  File "c:\talos-slave\talos-data\talos\ttest.py", line 149, in cleanupProfile
    self._hostproc.removeDirectory(dir)
  File "c:\talos-slave\talos-data\talos\ffprocess_win32.py", line 203, in removeDirectory
    shutil.rmtree(dir)
  File "C:\Python24\lib\shutil.py", line 163, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "C:\Python24\lib\shutil.py", line 163, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "C:\Python24\lib\shutil.py", line 168, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "C:\Python24\lib\shutil.py", line 166, in rmtree
    os.remove(fullname)
OSError: [Errno 13] Permission denied: 'c:\\users\\cltbld\\appdata\\local\\temp\\tmpqo0enl\\profile\\Cache\\_CACHE_001_'

[2] 
Error pulling changes into e:\builds\moz2_slave\m-in-w64-ntly\build from http://hg.mozilla.org/integration/mozilla-inbound; clobbering
command: START
command: hg clone -r 95bbaf6cb2a6c9a4d3375da8381cb8db909ec4a0 http://hg.mozilla.org/integration/mozilla-inbound e:\\\\builds\\\\moz2_slave\\\\m-in-w64-ntly\\\\build
command: cwd: e:\builds\moz2_slave\m-in-w64-ntly
command: output:

command timed out: 3600 seconds without output, attempting to kill
SIGKILL failed to kill process
using fake rc=-1
program finished with exit code -1

remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last):
Failure: exceptions.RuntimeError: SIGKILL failed to kill process
]

[3]
WARNING: 1 sort operation has occurred for the SQL statement '0x16315df8'.  See https://developer.mozilla.org/En/Storage/Warnings details.: file e:/builds/moz2_slave/m-in-w32-dbg/build/storage/src/mozStoragePrivateHelpers.cpp, line 144
TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output

command timed out: 1200 seconds without output, attempting to kill
SIGKILL failed to kill process
using fake rc=-1
program finished with exit code -1
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #4)
> Oh sigh. I thought these were not an issue anymore.
> 
> I bet this [a] got lost with the newer version of buildbot/twisted 10.1 [b]
> 
> [a]
> https://wiki.mozilla.org/ReferencePlatforms/Test/
> WinXP#Twisted_patch_to_allow_buildbot_to_kill_jobs

I believe Dustin was concerned about this particular patch when he was rolling out the new buildbot version, but couldn't find anyone at the time who could give him details.

Armen: can you verify that the affected slaves in the logs that philor linked are, in fact, missing this twisted patch? Note: I'm not asking you to take the bug, just verify the cause.
OS: Windows 7 → All
Priority: -- → P3
Whiteboard: [windows][builldbot]
I believe this is the issue:
C:\Users\cltbld>C:\mozilla-build\wget\wget.exe http://hg.mozilla.org/build/opsi-package-sources/raw-file/520de951bbb0/twisted_dumbwin32proc/CLIENT_DATA/_dumbwin32proc.py
C:\Users\cltbld>C:\mozilla-build\msys\bin\diff.exe _dumbwin32proc.py C:\mozilla-build\buildbotve\Lib\site-packages\twisted\internet\_dumbwin32proc.py
241c241,242
<             os.popen('taskkill /T /F /PID %s' % self.pid)
---
>             win32process.TerminateProcess(self.hProcess, 1)
>

We should deploy that version to all Windows build and test slaves.
That diff is monkeypatched in - see bug 666019

So it's possible that monkeypatch isn't working correctly, or that there's some other killing-processes-on-windows patch that used to be in place, but which nobody could remember well enough to point me to.  I would recommend starting your diagnostics there, rather than patching over the problem by hacking _dumbwin32proc.py.
I just hit this on WinXP Debug TryServer.  I was expecting an orange result (from a crashtest that's expected to hang), but got purple on WinXP Debug instead, since we fail to kill the hanging process.
https://tbpl.mozilla.org/?tree=Try&rev=0762a4443dc1
https://tbpl.mozilla.org/php/getParsedLog.php?id=6842811&tree=Try
https://tbpl.mozilla.org/php/getParsedLog.php?id=6844930&tree=Try
I wasn't too worried about this, because I look at every single failed result no matter what the color, but it turns out that in general people just totally ignore purple, and also believe that all purple is the same, so if they push a Windows crash to try, they just assume try is broken when they get purple, and go ahead and push it for real.
Severity: normal → blocker
Summary: (Some?) Windows slaves: SIGKILL failed to kill process → Windows slaves: SIGKILL failed to kill process
I guess this isn't actually blocking development, just making it miserable.
Severity: blocker → critical
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> That diff is monkeypatched in - see bug 666019
> 
> So it's possible that monkeypatch isn't working correctly, or that there's
> some other killing-processes-on-windows patch that used to be in place, but
> which nobody could remember well enough to point me to.  I would recommend
> starting your diagnostics there, rather than patching over the problem by
> hacking _dumbwin32proc.py.

Actually I just looked because the *newly* rebuilt SeaMonkey slaves hit this.

And it looks like the patch from Bug 666019 despite mentioning it was deployed to the slaves branch, was actually deployed to default, then merged to production-0.8 and never hit the slaves branch.

I suggest we either manually apply this patch to our slaves or deploy a buildbot 0.8.4-pre-moz3
(In reply to Justin Wood (:Callek) from comment #14)
> (In reply to Dustin J. Mitchell [:dustin] from comment #7)
> And it looks like the patch from Bug 666019 despite mentioning it was
> deployed to the slaves branch, was actually deployed to default, then merged
> to production-0.8 and never hit the slaves branch.

Correction: was never deployed to hg at all (I looked at wrong monkeypatch)
Is there any chance this will ever be fixed, or should I patch tbpl to lie about the status of jobs, and show all purple as orange?
(In reply to Phil Ringnalda (:philor) from comment #16)
> Is there any chance this will ever be fixed, or should I patch tbpl to lie
> about the status of jobs, and show all purple as orange?

Just rediscovered today in triage, so...possibly?

Let's dupe to bug 666019 and get that deployed.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.