Closed
Bug 972867
Opened 12 years ago
Closed 10 years ago
Windows ix test machines don't always start buildbot
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: bhearsum, Unassigned)
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2054] )
Right now we have tons of windows test machines offline. Every one that I've looked at successfully ran a job, ran count_and_reboot.py, but didn't reboot (determined by comparing the time that count_and_reboot.py ran to the start time of the machine from "net statistics server"). For example:
t-xp32-ix-116 has a start time of 2/13/2014 5:21am.
It's second last job completed at 5:18am (http://buildbot-master74.srv.releng.usw2.mozilla.com:8201/builders/Windows%20XP%2032-bit%20mozilla-inbound%20talos%20chromez/builds/1310)
It's last job started at 5:22am, and completed at 5:42am (http://buildbot-master74.srv.releng.usw2.mozilla.com:8201/builders/Windows%20XP%2032-bit%20try%20debug%20test%20reftest/builds/1527)
count_and_reboot.py doesn't seem to output anything, so it's hard to determine what's happening here. twistd.log output is also unhelpful:
2014-02-13 05:18:56-0800 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x011B0878>
2014-02-13 05:18:56-0800 [Broker,client] startCommand:shell [id 895791]
2014-02-13 05:18:56-0800 [Broker,client] RunProcess._startCommand
2014-02-13 05:18:56-0800 [Broker,client] 'c:/mozilla-build/python27/python' '-u' 'scripts/external_tools/count_and_reboot.py' '-f' '../reboot_count.t
xt' '-n' '1' '-z'
2014-02-13 05:18:56-0800 [Broker,client] in dir C:\\slave\test\. (timeout 1200 secs)
2014-02-13 05:18:56-0800 [Broker,client] watching logfiles {}
2014-02-13 05:18:56-0800 [Broker,client] argv: ['c:/mozilla-build/python27/python', '-u', 'scripts/external_tools/count_and_reboot.py', '-f', '../re
boot_count.txt', '-n', '1', '-z']
2014-02-13 05:18:56-0800 [Broker,client] environment: {'TMP': 'C:\\DOCUME~1\\CLTBLD~1.T-X\\LOCALS~1\\Temp', 'COMPUTERNAME': 'T-XP32-IX-116', 'MOZ_NO_
REMOTE': '1', 'USERDOMAIN': 'T-XP32-IX-116', 'KTS_VERSION': '1.19c', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'PROCESSOR_IDENTIFIER':
'x86 Family 6 Model 30 Stepping 5, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '1e05', 'SYSTEMROOT': 'C:\\WINDOWS', 'PAT
H': 'C:\\Python24;C:\\Python24\\Scripts;C:\\cygwin\\bin;C:\\WINDOWS\\System32;C:\\program files\\gnuwin32\\bin;C:\\WINDOWS;C:\\mozilla-build\\msys\\bi
n;C:\\mozilla-build\\python27;C:\\mozilla-build\\python27\\Scripts;C:\\mozilla-build\\info-zip;C:\\mozilla-build\\wget;C:\\Program Files (x86)\\GnuWin
32\\bin;', 'NO_EM_RESTART': '1', 'XPCOM_DEBUG_BREAK': 'warn', 'TEMP': 'C:\\DOCUME~1\\CLTBLD~1.T-X\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', '
ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld.T-XP32-IX-116', 'US
ERNAME': 'cltbld', 'LOGONSERVER': '\\\\T-XP32-IX-116', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'CYGWINBASE': 'C:\\cygwin', 'PAT
HEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.PSC1', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIV
E': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld.T-XP32-IX-116\\Application Data', 'SYSTEMDRIVE': 'C:', 'KTS_HOME': 'C:\\Program Files\\KTS',
'NUMBER_OF_PROCESSORS': '8', 'PWD': 'C:\\slave\\test', 'PROCESSOR_LEVEL': '6', 'PROPERTIES_FILE': 'C:\\slave\\test/buildprops.json', 'MOZ_CRASHREPORTE
R_NO_REPORT': '1', 'OS': 'Windows_NT', 'USERPROFILE': 'C:\\Documents and Settings\\cltbld.T-XP32-IX-116'}
2014-02-13 05:18:56-0800 [Broker,client] using PTY: False
2014-02-13 05:23:10-0800 [-] Log opened.
2014-02-13 05:23:10-0800 [-] twistd 10.2.0 (C:\mozilla-build\buildbotve\scripts\python.exe 2.7.3) starting up.
2014-02-13 05:23:10-0800 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
<snip>
2014-02-13 05:41:01-0800 [Broker,client] RunProcess._startCommand
2014-02-13 05:41:01-0800 [Broker,client] 'c:/mozilla-build/python27/python' '-u' 'scripts/external_tools/count_and_reboot.py' '-f' '../reboot_count.t
xt' '-n' '1' '-z'
2014-02-13 05:41:01-0800 [Broker,client] in dir C:\\slave\test\. (timeout 1200 secs)
2014-02-13 05:41:01-0800 [Broker,client] watching logfiles {}
2014-02-13 05:41:01-0800 [Broker,client] argv: ['c:/mozilla-build/python27/python', '-u', 'scripts/external_tools/count_and_reboot.py', '-f', '../re
boot_count.txt', '-n', '1', '-z']
2014-02-13 05:41:01-0800 [Broker,client] environment: {'TMP': 'C:\\DOCUME~1\\CLTBLD~1.T-X\\LOCALS~1\\Temp', 'MOZILLABUILD': 'D:\\mozilla-build', 'COM
PUTERNAME': 'T-XP32-IX-116', 'MOZ_NO_REMOTE': '1', 'USERDOMAIN': 'T-XP32-IX-116', 'KTS_VERSION': '1.19c', 'LIBPATH': 'C:\\WINDOWS\\Microsoft.NET\\Fram
ework\\v2.0.50727;D:\\msvs8\\VC\\ATLMFC\\LIB', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'MOZILLABUILDPATH': '\\mozilla-build\\', 'PROC
ESSOR_IDENTIFIER': 'x86 Family 6 Model 30 Stepping 5, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '1e05', 'SYSTEMROOT':
'C:\\WINDOWS', 'PATH': 'C:\\Program Files\\NVIDIA Corporation\\PhysX\\Common;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS
\\system32\\WindowsPowerShell\\v1.0;C:\\mozilla-build\\python27;C:\\mozilla-build\\python27\\Scripts;C:\\mozilla-build\\msys\\bin;C:\\mozilla-build\\v
im\\vim72;C:\\mozilla-build\\wget;C:\\mozilla-build\\info-zip;C:\\CoreUtils\\bin;C:\\mozilla-build\\buildbotve\\scripts;C:\\mozilla-build\\hg', 'NO_EM
_RESTART': '1', 'MSVCDir': 'D:\\msvs8\\VC', 'XPCOM_DEBUG_BREAK': 'warn', 'TEMP': 'C:\\DOCUME~1\\CLTBLD~1.T-X\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE
': 'x86', 'VCVARS': 'D:\\msvs8\\VC\\bin\\vcvars32.bat', 'VSINSTALLDIR': 'D:\\msvs8', 'ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'DevE
nvDir': 'D:\\msvs8\\Common7\\IDE', 'MOZILLABUILDDRIVE': 'C:', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld.T-XP32-IX-116',
'FrameworkDir': 'C:\\WINDOWS\\Microsoft.NET\\Framework', 'MOZ_HIDE_RESULTS_TABLE': '1', 'FrameworkVersion': 'v2.0.50727', 'USERNAME': 'cltbld', 'LOGON
SERVER': '\\\\T-XP32-IX-116', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'MOZ_TOOLS': 'D:\\mozilla-build\\moztools', 'MOZ_AIRBAG':
'1', 'NO_FAIL_ON_TEST_ERRORS': '1', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.PSC1', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK':
'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld.T-XP32-IX-116\\Application Data', 'MOZ_MSVCVERSION':
'8', 'SYSTEMDRIVE': 'C:', 'KTS_HOME': 'C:\\Program Files\\KTS', 'NUMBER_OF_PROCESSORS': '8', 'PWD': 'C:\\slave\\test', 'PROCESSOR_LEVEL': '6', 'PROPE
RTIES_FILE': 'C:\\slave\\test/buildprops.json', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'VCINSTALLDIR': 'D:\\msvs8\\VC', 'OS': 'Windows_NT', 'FrameworkSDK
Dir': 'D:\\msvs8\\SDK\\v2.0', 'USERPROFILE': 'C:\\Documents and Settings\\cltbld.T-XP32-IX-116'}
2014-02-13 05:41:01-0800 [Broker,client] using PTY: False
(end of file)
Updated•12 years ago
|
Assignee: nobody → armenzg
Comment 1•12 years ago
|
||
It seems that we're aborting reboots when graceful shutdown fails:
2014-02-12 16:28:06,380 - INFO - t-w732-ix-097 - Graceful shutdown failed, aborting reboot [1]
The issue started on 2014-02-12.
This started happening after the landings on bug 971737.
If we look at slaveapi.log we will see this:
2014-02-12 14:44:28,436 - WARNING - 10.26.40.247 - First password as administrator didn't work.
2014-02-12 14:44:29,828 - INFO - 10.26.40.247 - Connection as root succeeded!
...
2014-02-12 16:26:25,832 - INFO - 10.26.40.247 - Connection as root succeeded!
2014-02-12 16:27:28,541 - WARNING - 10.26.40.247 - First password as administrator didn't work.
This mean that we try for almost 2 hours and then give up.
[1]
[cltbld@buildbot-master65.srv.releng.usw2.mozilla.com slaverebooter]$ grep "w732-ix-097" slaverebooter.log | grep "Graceful"
2014-01-11 16:44:31,605 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting
2014-01-18 12:44:22,586 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting
2014-01-19 00:44:24,587 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting
2014-01-19 20:44:27,021 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting
2014-01-30 20:44:37,073 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting
2014-02-02 04:44:27,932 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting
2014-02-02 08:44:39,391 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting
2014-02-03 16:44:31,856 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting
2014-02-12 16:28:06,380 - INFO - t-w732-ix-097 - Graceful shutdown failed, aborting reboot
[2]
2014-02-12 14:44:23,369 - INFO - t-w732-ix-097 - Getting slavealloc info
2014-02-12 14:44:23,370 - INFO - Making request to: http://slavealloc.build.mozilla.org/api/slaves/t-w732-ix-097?byname=1
2014-02-12 14:44:23,405 - INFO - t-w732-ix-097 - Getting inventory info
2014-02-12 14:44:23,809 - INFO - t-w732-ix-097 - Getting bug info
2014-02-12 14:44:23,809 - INFO - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-w732-ix-097
2014-02-12 14:44:23,987 - INFO - t-w732-ix-097 - Getting recent job info
2014-02-12 14:44:24,103 - INFO - 10.132.49.112 - - [2014-02-12 14:44:24] "GET /slaves/t-w732-ix-097 HTTP/1.1" 200 809 0.738124
2014-02-12 14:44:24,215 - INFO - 10.132.49.112 - - [2014-02-12 14:44:24] "POST /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30 HTTP/1.1" 202 171 0.002082
2014-02-12 14:44:24,218 - INFO - t-w732-ix-097 - Getting slavealloc info
2014-02-12 14:44:24,219 - INFO - Making request to: http://slavealloc.build.mozilla.org/api/slaves/t-w732-ix-097?byname=1
2014-02-12 14:44:27,277 - INFO - t-w732-ix-097 - Starting graceful shutdown.
2014-02-12 14:44:54,354 - INFO - 10.132.49.112 - - [2014-02-12 14:44:54] "GET /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30&requestid=68555728 HTTP/1.1" 200 139 0.000738
2014-02-12 14:45:24,497 - INFO - 10.132.49.112 - - [2014-02-12 14:45:24] "GET /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30&requestid=68555728 HTTP/1.1" 200 139 0.000772
2014-02-12 14:45:54,638 - INFO - 10.132.49.112 - - [2014-02-12 14:45:54] "GET /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30&requestid=68555728 HTTP/1.1" 200 139 0.000780
2014-02-12 14:46:24,778 - INFO - 10.132.49.112 - - [2014-02-12 14:46:24] "GET /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30&requestid=68555728 HTTP/1.1" 200 139 0.000849
Reporter | ||
Comment 2•12 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #1)
> It seems that we're aborting reboots when graceful shutdown fails:
> 2014-02-12 16:28:06,380 - INFO - t-w732-ix-097 - Graceful shutdown failed,
> aborting reboot [1]
"aborting reboot" here means "slaverebooter will not trigger a reboot through slaveapi". As far as I know, it should have no effect an the reboot at the end of the job.
Comment 3•12 years ago
|
||
bhearsum, should we had seen anything on the logs for t-w732-ix-097 from the 6th (when it stopped taking jobs) until the 12th? Was slaverebooter not working during that time?
The question is not necessarily why it did not reboot but why it did not start buildbot?
The machine did reboot. I don't know more about why the output was not clearer on twistd.log.
I can see that the talosslave startup task run at 2014-02-06 4:46:08pm and the operation completed successfully at 4:47:01pm:
> Task Scheduler successfully completed task "\talosslave" , instance
> "{a5de2c77-ce18-4fb9-9bac-5219416762c9}" , action "C:\windows\SYSTEM32\cmd.exe" with return code 0.
Not sure if any of the following is relevant or not:
2014-02-06 4:46:08PM - After the machine started up I saw this Error message
################
This computer was not able to set up a secure session with a domain controller in domain RELENG due to the following:
There are currently no logon servers available to service the logon request.
This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator.
ADDITIONAL INFO
If this computer is a domain controller for the specified domain, it sets up the secure session to the primary domain controller emulator in the specified domain. Otherwise, this computer sets up the secure session to any domain controller in the specified domain.
###############
Reporter | ||
Comment 4•12 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #3)
> bhearsum, should we had seen anything on the logs for t-w732-ix-097 from the
> 6th (when it stopped taking jobs) until the 12th? Was slaverebooter not
> working during that time?
It ran once on the 6th:
2014-02-06 16:44:17,354 - INFO - t-w732-ix-096 - Skipping reboot because last job ended recently at Thursday, February 06, 15:39
But was otherwise down because I had to make some fixes after https://bugzilla.mozilla.org/show_bug.cgi?id=931854.
Updated•12 years ago
|
Summary: windows ix test machines don't always reboot → Windows ix test machines don't always start buildbot
Comment 5•11 years ago
|
||
armen, are you still looking into this, do we have any ETA's etc?
I see 15 no-job w7's and 21 XP's atm.
Slaverebooter seems to be failing to shutdown the buildslave instance (as is a known bug there). But I'm suspecting there is some systemic issue here that could/should be triaged.
Flags: needinfo?(armenzg)
Comment 6•11 years ago
|
||
No, I don't currently have time for this.
I added some logging in bug 973004 like this:
C:\mozilla-build\python27\python c:\slave\runslave.py > C:\slave\runslave.log
Flags: needinfo?(armenzg)
Comment 7•11 years ago
|
||
dropping armen from assignee to properly reflect c#6 bug state
Assignee: armenzg → nobody
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2045]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2045] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2054]
Comment 8•10 years ago
|
||
No activity + runner should supervise buildbot.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•