XP jobs can run debug crashtests for days and then come back without rebooting properly

RESOLVED FIXED

Status

P1
normal
RESOLVED FIXED
6 years ago
3 months ago

People

(Reporter: armenzg, Assigned: armenzg)

Tracking

Details

Attachments

(3 attachments)

(Assignee)

Description

6 years ago
If we look at this job:
http://buildbot-master15.build.scl1.mozilla.com:8201/builders/Rev3%20WINNT%205.1%20mozilla-inbound%20debug%20test%20crashtest/builds/5414

We can see that philor interrupted through self-serve a job that had been running for 69 hours.

Before we deployed the newer _dumbwin32proc.py I assume he could not interrupt the jobs and eventually a releng person would reboot the machine back into production.

It seems like interrupting a job does not execute the reboot step.

I don't know what is the right way of fixing this.


14:58:21     INFO -  nsParser::ResumeParse(bool,bool,bool) [parser/htmlparser/src/nsParser.cpp:1524]
Output exceeded 52428800 bytes, remaining output has been truncated

command timed out: 7200 seconds elapsed, attempting to kill

command interrupted, attempting to kill

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
When did I tell self-serve to stop it, and when did it stop? I don't actually know the answer to either question, but my observations of the ones that are running, the times I remember trying to get them to stop, and the times (maybe 8 and 24 hours later) that only a few of them have stopped make it something I would instead describe as "something unknown causes crashtests on WinXP to run for days, ignoring entreaties to stop, until on their own they decide to stop without rebooting, or just finally lose their connection." After all, we kill hundreds of tests jobs on Try every day, and they do not then burn talos jobs because they still have Firefox running.
And if you have ideas for experimenting with them, we still have three than have been running since last Friday, one from Saturday and one from Sunday, and two more from today.
(Assignee)

Comment 3

6 years ago
(In reply to Phil Ringnalda (:philor) from comment #2)
> And if you have ideas for experimenting with them, we still have three than
> have been running since last Friday, one from Saturday and one from Sunday,
> and two more from today.

Thanks that will be great!

Maybe it shows 2 interruption attempts that perhaps did not go through.

I will investigate.
Assignee: nobody → armenzg
(Assignee)

Comment 4

6 years ago
Created attachment 729498 [details]
screenshot

I've seen this before on bug 853112.

It's on a specific reftest: 2278/2323

Any idea on how to identify it?
(Assignee)

Updated

6 years ago
Summary: XP jobs can run for days and then come back without rebooting properly → XP jobs can run debug crashtests for days and then come back without rebooting properly
(Assignee)

Comment 5

6 years ago
I will port info from bug 853112 forward:
(In reply to Armen Zambrano G. [:armenzg] from comment #0)
> For instance this:
> http://buildbot-master46.build.scl1.mozilla.com:8201/builders/
> Rev3%20WINNT%205.1%20mozilla-inbound%20debug%20test%20crashtest/builds/35
> 
> The SIGKILL was sent successfully. I was hoping that deploying
> _dumbwin32proc.py would help with situations like this.
> http://mxr.mozilla.org/build/source/buildbot/slave/buildslave/runprocess.
> py#723
> 
> I don't see anything on the Windows event viewer.
> 
> Any ideas on what I could look into?
> 
> If I kill the browser running the reftests will tell me that an unresponsive
> Firefox was kill by me and if I want to report it.
> 
> Doing these two manual steps did not cause for the job to progress.
> 
> Hitting "stop build" on buildbot made us go forward. [2]
> 
> ##################
> 
> On another slave with the same symptom I hit "stop build" since the
> beginning:
> http://buildbot-master48.build.scl1.mozilla.com:8201/builders/
> Rev3%20WINNT%205.1%20try%20debug%20test%20crashtest/builds/17
> 
> This approach does nothing either. [3]
> Killing the browser manually did not make us recover either.
> 
> This time I actually had to run "shutdown -f -r -t 0".
> 
> Any suggestions?
> 
> I assume briar patch would eventually reboot the machines but it would be
> interesting to figure out what is going on.
> 
> 
> [1]
> 
> 2013-03-19 19:12:29-0700 [Broker,client]   in dir c:\talos-slave\test\.
> (timeout 1200 secs) (maxTime 7200 secs)                                
> 2013-03-19 19:12:29-0700 [Broker,client]   watching logfiles {}             
> 
> 2013-03-19 19:12:29-0700 [Broker,client]   argv:
> ['c:/mozilla-build/python27/python', '-u',
> 'scripts/scripts/desktop_unittest.py', '--cfg', 'unitte
> sts/win_unittest.py', '--reftest-suite', 'crashtest', '--download-symbols',
> 'true']                                                                
> 2013-03-19 19:12:29-0700 [Broker,client]  environment: {'TMP':
> 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'MOZILLABUILD': 'D:\\mozilla-build',
> 'COMPUT
> ERNAME': 'TALOS-R3-XP-035', 'MOZ_NO_REMOTE': '1', 'USERDOMAIN':
> 'TALOS-R3-XP-035', 'LIBPATH':
> 'C:\\WINDOWS\\Microsoft.NET\\Framework\\v2.0.50727;D:
> \\msvs8\\VC\\ATLMFC\\LIB', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common
> Files', 'MOZILLABUILDPATH': '\\mozilla-build\\', 'PROCESSOR_IDENTIFIER'
> : 'x86 Family 6 Model 23 Stepping 10, GenuineIntel', 'PROGRAMFILES':
> 'C:\\Program Files', 'PROCESSOR_REVISION': '170a', 'SYSTEMROOT':
> 'C:\\WINDOWS'
> , 'PATH':
> 'C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;c:\\Program
> Files\\gnuwin32\\bin;c:\\Python24\\;c:\\Python24\\scripts;c:\\
> Program Files\\Vim\\vim72', 'NO_EM_RESTART': '1', 'MSVCDir':
> 'D:\\msvs8\\VC', 'XPCOM_DEBUG_BREAK': 'warn', 'TEMP':
> 'C:\\DOCUME~1\\cltbld\\LOCALS~1\
> \Temp', 'PROCESSOR_ARCHITECTURE': 'x86', 'VCVARS':
> 'D:\\msvs8\\VC\\bin\\vcvars32.bat', 'VSINSTALLDIR': 'D:\\msvs8',
> 'ALLUSERSPROFILE': 'C:\\Documen
> ts and Settings\\All Users', 'DevEnvDir': 'D:\\msvs8\\Common7\\IDE',
> 'MOZILLABUILDDRIVE': 'C:', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents
> and Settings\\cltbld', 'FrameworkDir':
> 'C:\\WINDOWS\\Microsoft.NET\\Framework', 'MOZ_HIDE_RESULTS_TABLE': '1',
> 'FrameworkVersion': 'v2.0.50727', 'U
> SERNAME': 'cltbld', 'LOGONSERVER': '\\\\TALOS-R3-XP-035', 'PROMPT': '$P$G',
> 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'MOZ_TOOLS': 'D:\\mozilla-
> build\\moztools', 'BOOTMODE': 'BKSTD', 'NO_FAIL_ON_TEST_ERRORS': '1',
> 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH', 'CLIENTNAME':
> 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIVE':
> 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld\\Application Data',
>  'MOZ_AIRBAG': '1', 'SYSTEMDRIVE': 'C:', 'MOZ_MSVCVERSION': '8',
> 'NUMBER_OF_PROCESSORS': '2', 'PWD': 'c:\\talos-slave\\test',
> 'PROCESSOR_LEVEL': '6
> ', 'PROPERTIES_FILE': 'c:\\talos-slave\\test/buildprops.json',
> 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'VCINSTALLDIR': 'D:\\msvs8\\VC', 'OS':
> 'Windows_
> NT', 'FrameworkSDKDir': 'D:\\msvs8\\SDK\\v2.0', 'USERPROFILE':
> 'C:\\Documents and Settings\\cltbld'}                               
> 2013-03-19 19:12:29-0700 [Broker,client]   using PTY: False                 
> 
> 2013-03-19 21:12:29-0700 [-] command timed out: 7200 seconds elapsed,
> attempting to kill
> 2013-03-19 21:12:29-0700 [-] trying process.signalProcess('KILL')           
> 
> 2013-03-19 21:12:29-0700 [-]  signal KILL sent successfully
> 
> [2] after hitting "stop build"
> 2013-03-19 21:12:29-0700 [-]  signal KILL sent successfully                 
> 
> 2013-03-20 11:43:06-0700 [Broker,client] asked to interrupt current command:
> The web-page 'stop build' button was pressed by '&lt;unknown&gt;':
> 
> 2013-03-20 11:43:08-0700 [Broker,client] command interrupted, attempting to
> kill
> 2013-03-20 11:43:08-0700 [Broker,client] trying process.signalProcess('KILL')
> 2013-03-20 11:43:09-0700 [Broker,client] Process exited already - can't kill
> 2013-03-20 11:43:09-0700 [Broker,client] signalProcess/os.kill failed both
> times
> 2013-03-20 11:43:14-0700 [-] we tried to kill the process, and it wouldn't
> die.. finish anyway
> 2013-03-20 11:43:14-0700 [-] RunProcess.failed: command failed: SIGKILL
> failed to kill process
> 2013-03-20 11:43:15-0700 [-] SlaveBuilder.commandFailed
> <buildslave.commands.shell.SlaveShellCommand instance at 0x0186BC88>
> 2013-03-20 11:43:15-0700 [-] Unhandled Error
>         Traceback (most recent call last):
>         Failure: exceptions.RuntimeError: SIGKILL failed to kill process
> 2013-03-20 11:43:18-0700 [Broker,client]  startCommand:shell [id 161169]
> 
> [3] 
> 2013-03-19 22:59:16-0700 [-] command timed out: 7200 seconds elapsed,
> attempting to kill                                                  
> 2013-03-19 22:59:16-0700 [-] trying process.signalProcess('KILL')
> 2013-03-19 22:59:16-0700 [-]  signal KILL sent successfully
> 2013-03-20 11:46:05-0700 [Broker,client] asked to interrupt current command:
> The web-page 'stop build' button was pressed by '&lt;unknown&gt;':          
> 
> 2013-03-20 11:46:05-0700 [Broker,client] command interrupted, attempting to
> kill                                                                   
> 2013-03-20 11:46:05-0700 [Broker,client] trying process.signalProcess('KILL')
> 2013-03-20 11:46:05-0700 [Broker,client] Process exited already - can't kill
> 
> 2013-03-20 11:46:05-0700 [Broker,client] signalProcess/os.kill failed both
> times
(Assignee)

Updated

6 years ago
Duplicate of this bug: 853112
(Assignee)

Comment 7

6 years ago
I wonder if we should revert the _dumbwin32proc.py change and get back to our original state instead of debugging this.
(Assignee)

Comment 8

6 years ago
BTW this started happening intermittently on debug crashtests after we deployed a newer _dumbwin32proc.py

> This is what I did on a w7 and an xp slave:
> cd C:\mozilla-build\buildbotve\Lib\site-packages\twisted\internet
> del _dumbwin32proc.pyc
> move _dumbwin32proc.py _dumbwin32proc.py.bak
> wget -q --no-check-certificate
> https://hg.mozilla.org/build/opsi-package-sources/raw-file/e55c081cb8cf/
> twisted_dumbwin32proc/CLIENT_DATA/_dumbwin32proc.py
> 
> I assume we need to do the same on win64 and win8 slaves.
> 
> #########################################
> diff -pU 8 _dumbwin32proc.py _dumbwin32proc.py.bak
> --- _dumbwin32proc.py   Wed Mar 13 10:41:13 2013
> +++ _dumbwin32proc.py.bak       Thu Sep 01 08:31:04 2011
> @@ -233,17 +233,18 @@ class Process(_pollingfile._PollingTimer
> 
>          self._addPollableResource(_Reaper(self))
> 
> 
>      def signalProcess(self, signalID):
>          if self.pid is None:
>              raise error.ProcessExitedAlready()
>          if signalID in ("INT", "TERM", "KILL"):
> -            os.popen('taskkill /T /F /PID %s' % self.pid)
> +            win32process.TerminateProcess(self.hProcess, 1)
> +
> 
>      def _getReason(self, status):
>          if status == 0:
>              return error.ProcessDone(status)
>          return error.ProcessTerminated(status)
> 
> 
>      def write(self, data):
for a current xp slave having issues: Running for 2 days, 16:51:28:

Tail of twistd.log on slave:

2013-03-23 14:24:23-0700 [Broker,client]  startCommand:shell [id 215286]
2013-03-23 14:24:23-0700 [Broker,client] RunProcess._startCommand
2013-03-23 14:24:23-0700 [Broker,client]  'c:\\mozilla-build\\hg\\hg' 'update' '-C' '-r' 'production'
2013-03-23 14:24:23-0700 [Broker,client]   in dir C:\\talos-slave\test\scripts (timeout 1200 secs)
2013-03-23 14:24:23-0700 [Broker,client]   watching logfiles {}
2013-03-23 14:24:23-0700 [Broker,client]   argv: ['c:\\mozilla-build\\hg\\hg', 'update', '-C', '-r', 'production']
2013-03-23 14:24:23-0700 [Broker,client]  environment: {'TMP': 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'COMPUTERNAME': 'TALOS-R3-XP-099', 'USERDOMAIN': 'TALOS-R3-XP-099', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'PROCESSOR_IDENTIFIER': 'x86 Family 6 Model 23 Stepping 10, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '170a', 'SYSTEMROOT': 'C:\\WINDOWS', 'PATH': 'C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;c:\\Program Files\\gnuwin32\\bin;c:\\Python24\\;c:\\Python24\\scripts;c:\\Program Files\\Vim\\vim72', 'TEMP': 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', 'ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld', 'USERNAME': 'cltbld', 'LOGONSERVER': '\\\\TALOS-R3-XP-099', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'BOOTMODE': 'BKSTD', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld\\Application Data', 'SYSTEMDRIVE': 'C:', 'NUMBER_OF_PROCESSORS': '2', 'PWD': 'C:\\talos-slave\\test\\scripts', 'PROCESSOR_LEVEL': '6', 'OS': 'Windows_NT', 'USERPROFILE': 'C:\\Documents and Settings\\cltbld'}
2013-03-23 14:24:23-0700 [Broker,client]   using PTY: False
2013-03-23 14:24:23-0700 [-] command finished with signal None, exit code 0, elapsedTime: 0.328000
2013-03-23 14:24:23-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x01826A58>
2013-03-23 14:24:23-0700 [Broker,client]  startCommand:shell [id 215287]
2013-03-23 14:24:23-0700 [Broker,client] RunProcess._startCommand
2013-03-23 14:24:23-0700 [Broker,client]  'c:\\mozilla-build\\hg\\hg' 'id' '-i'
2013-03-23 14:24:23-0700 [Broker,client]   in dir C:\\talos-slave\test\scripts (timeout 1200 secs)
2013-03-23 14:24:23-0700 [Broker,client]   watching logfiles {}
2013-03-23 14:24:23-0700 [Broker,client]   argv: ['c:\\mozilla-build\\hg\\hg', 'id', '-i']
2013-03-23 14:24:23-0700 [Broker,client]  environment: {'TMP': 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'COMPUTERNAME': 'TALOS-R3-XP-099', 'USERDOMAIN': 'TALOS-R3-XP-099', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'PROCESSOR_IDENTIFIER': 'x86 Family 6 Model 23 Stepping 10, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '170a', 'SYSTEMROOT': 'C:\\WINDOWS', 'PATH': 'C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;c:\\Program Files\\gnuwin32\\bin;c:\\Python24\\;c:\\Python24\\scripts;c:\\Program Files\\Vim\\vim72', 'TEMP': 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', 'ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld', 'USERNAME': 'cltbld', 'LOGONSERVER': '\\\\TALOS-R3-XP-099', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'BOOTMODE': 'BKSTD', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld\\Application Data', 'SYSTEMDRIVE': 'C:', 'NUMBER_OF_PROCESSORS': '2', 'PWD': 'C:\\talos-slave\\test\\scripts', 'PROCESSOR_LEVEL': '6', 'OS': 'Windows_NT', 'USERPROFILE': 'C:\\Documents and Settings\\cltbld'}
2013-03-23 14:24:23-0700 [Broker,client]   using PTY: False
2013-03-23 14:24:23-0700 [-] command finished with signal None, exit code 0, elapsedTime: 0.234000
2013-03-23 14:24:23-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x0180C1C0>
2013-03-23 14:24:23-0700 [Broker,client]  startCommand:shell [id 215288]
2013-03-23 14:24:23-0700 [Broker,client] RunProcess._startCommand
2013-03-23 14:24:23-0700 [Broker,client]  'c:/mozilla-build/python27/python' '-u' 'scripts/scripts/desktop_unittest.py' '--cfg' 'unittests/win_unittest.py' '--reftest-suite' 'crashtest' '--download-symbols' 'true'
2013-03-23 14:24:23-0700 [Broker,client]   in dir C:\\talos-slave\test\. (timeout 1200 secs) (maxTime 7200 secs)
2013-03-23 14:24:23-0700 [Broker,client]   watching logfiles {}
2013-03-23 14:24:23-0700 [Broker,client]   argv: ['c:/mozilla-build/python27/python', '-u', 'scripts/scripts/desktop_unittest.py', '--cfg', 'unittests/win_unittest.py', '--reftest-suite', 'crashtest', '--download-symbols', 'true']
2013-03-23 14:24:23-0700 [Broker,client]  environment: {'TMP': 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'MOZILLABUILD': 'D:\\mozilla-build', 'COMPUTERNAME': 'TALOS-R3-XP-099', 'MOZ_NO_REMOTE': '1', 'USERDOMAIN': 'TALOS-R3-XP-099', 'LIBPATH': 'C:\\WINDOWS\\Microsoft.NET\\Framework\\v2.0.50727;D:\\msvs8\\VC\\ATLMFC\\LIB', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'MOZILLABUILDPATH': '\\mozilla-build\\', 'PROCESSOR_IDENTIFIER': 'x86 Family 6 Model 23 Stepping 10, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '170a', 'SYSTEMROOT': 'C:\\WINDOWS', 'PATH': 'C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;c:\\Program Files\\gnuwin32\\bin;c:\\Python24\\;c:\\Python24\\scripts;c:\\Program Files\\Vim\\vim72', 'NO_EM_RESTART': '1', 'MSVCDir': 'D:\\msvs8\\VC', 'XPCOM_DEBUG_BREAK': 'warn', 'TEMP': 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', 'VCVARS': 'D:\\msvs8\\VC\\bin\\vcvars32.bat', 'VSINSTALLDIR': 'D:\\msvs8', 'ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'DevEnvDir': 'D:\\msvs8\\Common7\\IDE', 'MOZILLABUILDDRIVE': 'C:', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld', 'FrameworkDir': 'C:\\WINDOWS\\Microsoft.NET\\Framework', 'MOZ_HIDE_RESULTS_TABLE': '1', 'FrameworkVersion': 'v2.0.50727', 'USERNAME': 'cltbld', 'LOGONSERVER': '\\\\TALOS-R3-XP-099', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'MOZ_TOOLS': 'D:\\mozilla-build\\moztools', 'BOOTMODE': 'BKSTD', 'NO_FAIL_ON_TEST_ERRORS': '1', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld\\Application Data', 'MOZ_AIRBAG': '1', 'SYSTEMDRIVE': 'C:', 'MOZ_MSVCVERSION': '8', 'NUMBER_OF_PROCESSORS': '2', 'PWD': 'C:\\talos-slave\\test', 'PROCESSOR_LEVEL': '6', 'PROPERTIES_FILE': 'C:\\talos-slave\\test/buildprops.json', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'VCINSTALLDIR': 'D:\\msvs8\\VC', 'OS': 'Windows_NT', 'FrameworkSDKDir': 'D:\\msvs8\\SDK\\v2.0', 'USERPROFILE': 'C:\\Documents and Settings\\cltbld'}
2013-03-23 14:24:23-0700 [Broker,client]   using PTY: False
2013-03-23 16:24:23-0700 [-] command timed out: 7200 seconds elapsed, attempting to kill
2013-03-23 16:24:23-0700 [-] trying process.signalProcess('KILL')
2013-03-23 16:24:23-0700 [-]  signal KILL sent successfully
------------------
Tail of current job log (from buildbot master):

14:43:21     INFO -  XREMain::XRE_main(int,char * * const,nsXREAppData const *) [toolkit/xre/nsAppRunner.cpp:3947]

Output exceeded 52428800 bytes, remaining output has been truncated

command timed out: 7200 seconds elapsed, attempting to kill
------------------

Which reached the output exceeded due to repeatedly hitting a single assertion:

14:41:07     INFO -  REFTEST TEST-START | file:///C:/talos-slave/test/build/tests/reftest/tests/parser/htmlparser/tests/crashtests/460706-1.xhtml | 2276 / 2321 (98%)
14:41:07     INFO -  ++DOMWINDOW == 96 (22F438A8) [serial = 5128] [outer = 09370BE0]
14:41:07     INFO -  WARNING: No principal to execute JS with: file e:/builds/moz2_slave/try-w32-d-00000000000000000000/build/dom/src/jsurl/nsJSProtocolHandler.cpp, line 159
14:41:07     INFO -  ###!!! ASSERTION: ?: '(aBuffer && aLength != 0) || (!aBuffer && aLength == 0)', file e:/builds/moz2_slave/try-w32-d-00000000000000000000/build/parser/htmlparser/src/nsExpatDriver.cpp, line 976
14:41:07     INFO -  nsExpatDriver::ConsumeToken(nsScanner &,bool &) [parser/htmlparser/src/nsExpatDriver.cpp:1095]
14:41:07     INFO -  nsParser::Tokenize(bool) [parser/htmlparser/src/nsParser.cpp:2034]
14:41:07     INFO -  nsParser::ResumeParse(bool,bool,bool) [parser/htmlparser/src/nsParser.cpp:1524]
14:41:07     INFO -  nsParser::ContinueInterruptedParsing() [parser/htmlparser/src/nsParser.cpp:1116]
14:41:07     INFO -  nsXMLContentSink::ContinueInterruptedParsingIfEnabled() [content/xml/document/src/nsXMLContentSink.cpp:1646]
14:41:07     INFO -  nsThread::ProcessNextEvent(bool,bool *) [xpcom/threads/nsThread.cpp:627]
14:41:07     INFO -  NS_ProcessNextEvent_P(nsIThread *,bool) [obj-firefox/xpcom/build/nsThreadUtils.cpp:238]
14:41:07     INFO -  mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [ipc/glue/MessagePump.cpp:82]
14:41:07     INFO -  MessageLoop::RunInternal() [ipc/chromium/src/base/message_loop.cc:216]
14:41:07     INFO -  MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:210]
14:41:07     INFO -  MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:184]
14:41:07     INFO -  nsBaseAppShell::Run() [widget/xpwidgets/nsBaseAppShell.cpp:165]
14:41:07     INFO -  nsAppShell::Run() [widget/windows/nsAppShell.cpp:161]
14:41:07     INFO -  nsAppStartup::Run() [toolkit/components/startup/nsAppStartup.cpp:289]
14:41:07     INFO -  XREMain::XRE_mainRun() [toolkit/xre/nsAppRunner.cpp:3880]
14:41:07     INFO -  XREMain::XRE_main(int,char * * const,nsXREAppData const *) [toolkit/xre/nsAppRunner.cpp:3947]
14:41:07     INFO -  XRE_main [toolkit/xre/nsAppRunner.cpp:4161]
14:41:07     INFO -  do_main [browser/app/nsBrowserApp.cpp:228]
14:41:07     INFO -  NS_internal_main(int,char * *) [browser/app/nsBrowserApp.cpp:529]
14:41:07     INFO -  wmain [toolkit/xre/nsWindowsWMain.cpp:112]
14:41:07     INFO -  __tmainCRTStartup [f:/dd/vctools/crt_bld/self_x86/crt/src/crtexe.c:552]
14:41:07     INFO -  wmainCRTStartup [f:/dd/vctools/crt_bld/self_x86/crt/src/crtexe.c:371]
14:41:07     INFO -  kernel32 + 0x16d4f

----------------

Of extra notes:

50% cpu (on a dual-core system) is being used with |C:\mozilla-build\buildbotve\scripts\python.exe C:\mozilla-build\buildbotve\scripts\twistd.py --no_save --logfile C:\\talos-slave\twistd.log --python C:\\talos-slave\buildbot.tac|

with no subprocs on that

25% cpu on |C:\talos-slave\test\build\venv\Scripts\python -u C:\talos-slave\test\build\tests\reftest/runreftest.py --appname=C:\talos-slave\test\build\application\firefox\firefox.exe --utility-path=tests/bin --extra-profile-file=tests/bin/plugins --symbols-path=C:\talos-slave\test\build\symbols tests/reftest/tests/testing/crashtest/crashtests.list|

and then its direct parent proc had 

25 % on that last one's parent |c:/mozilla-build/python27/python -u scripts/scripts/desktop_unittest.py --cfg unittests/win_unittest.py --reftest-suite crashtest --download-symbols true|

routinely 1% cpu on the Firefox proc, which has a window, but nothing being painted. (such as if the threads never returned to WM_MAIN)

;;

Force killed the firefox (and plugin-container) procs myself made the buildslave python proc < 1 % cpu, and the runreftest.py proc naturally died, desktop_unittest.py now running at ~1.5% cpu

--- perhaps before backing this out we should add a twistd log entry for what pid we're killing  with |> -            os.popen('taskkill /T /F /PID %s' % self.pid)|
(Assignee)

Updated

6 years ago
Priority: -- → P1
(Assignee)

Comment 10

6 years ago
Created attachment 729575 [details] [diff] [review]
update _dumbwin32proc.py for real

Unless we bump the version nothing really happens.

In other words, _dumbwin32proc.py has *not* been deployed to any of our XP machines and this bug is not a fallout of the deployment.

I've also added a correct version to back this out.
Attachment #729575 - Flags: review?(coop)
(Assignee)

Comment 11

6 years ago
Created attachment 729581 [details] [diff] [review]
log which pid we kill
Attachment #729581 - Flags: review?(bugspam.Callek)
Comment on attachment 729581 [details] [diff] [review]
log which pid we kill

iirc you need to bump the ver in the OPSI control file, but you're the opsi expert so I'll let you decide the validity of that
Attachment #729581 - Flags: review?(bugspam.Callek) → review+
(Assignee)

Comment 13

6 years ago
(In reply to Justin Wood (:Callek) from comment #12)
> Comment on attachment 729581 [details] [diff] [review]
> log which pid we kill
> 
> iirc you need to bump the ver in the OPSI control file, but you're the opsi
> expert so I'll let you decide the validity of that

I was going to take this change with the one that I asked coop for review. That way we bump the version for both of them.
Comment on attachment 729575 [details] [diff] [review]
update _dumbwin32proc.py for real

Review of attachment 729575 [details] [diff] [review]:
-----------------------------------------------------------------

Good ol' OPSI.
Attachment #729575 - Flags: review?(coop) → review+

Comment 15

6 years ago
Ideally when we update this harness to use mozbase's mozprocess we will have proper windows process handling and that should fix issues like this. (It manages nested process groups for windows much more reliably than the existing code in automation.py does).
(Assignee)

Updated

6 years ago
Attachment #729575 - Flags: checked-in+
(Assignee)

Updated

6 years ago
Attachment #729581 - Flags: checked-in+
(Assignee)

Comment 16

6 years ago
I landed it, deployed it, have seen a slave's version go to 1.1 and have seen the file deployed.
I've marked all XP slaves to receive the file.
I've marked talos-r3-xp-ref and win32-ix-ref to receive the file and rebooted them.

I will have an eye on the XP slaves all day.
(Assignee)

Comment 17

6 years ago
We should also file a dev bug.
There were more than a dozen of XP machines on the same stage.
Let's hope this deployment alleviates the problem.
(Assignee)

Comment 18

6 years ago
It seems to have fixed the issue.
Let's re-open if we see it again.

79 WINNT 5.1 jobs running and none of them over 2 hours.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering

Updated

3 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.