Closed Bug 972867 Opened 12 years ago Closed 10 years ago

Windows ix test machines don't always start buildbot

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: bhearsum, Unassigned)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2054] )

Right now we have tons of windows test machines offline. Every one that I've looked at successfully ran a job, ran count_and_reboot.py, but didn't reboot (determined by comparing the time that count_and_reboot.py ran to the start time of the machine from "net statistics server"). For example: t-xp32-ix-116 has a start time of 2/13/2014 5:21am. It's second last job completed at 5:18am (http://buildbot-master74.srv.releng.usw2.mozilla.com:8201/builders/Windows%20XP%2032-bit%20mozilla-inbound%20talos%20chromez/builds/1310) It's last job started at 5:22am, and completed at 5:42am (http://buildbot-master74.srv.releng.usw2.mozilla.com:8201/builders/Windows%20XP%2032-bit%20try%20debug%20test%20reftest/builds/1527) count_and_reboot.py doesn't seem to output anything, so it's hard to determine what's happening here. twistd.log output is also unhelpful: 2014-02-13 05:18:56-0800 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x011B0878> 2014-02-13 05:18:56-0800 [Broker,client] startCommand:shell [id 895791] 2014-02-13 05:18:56-0800 [Broker,client] RunProcess._startCommand 2014-02-13 05:18:56-0800 [Broker,client] 'c:/mozilla-build/python27/python' '-u' 'scripts/external_tools/count_and_reboot.py' '-f' '../reboot_count.t xt' '-n' '1' '-z' 2014-02-13 05:18:56-0800 [Broker,client] in dir C:\\slave\test\. (timeout 1200 secs) 2014-02-13 05:18:56-0800 [Broker,client] watching logfiles {} 2014-02-13 05:18:56-0800 [Broker,client] argv: ['c:/mozilla-build/python27/python', '-u', 'scripts/external_tools/count_and_reboot.py', '-f', '../re boot_count.txt', '-n', '1', '-z'] 2014-02-13 05:18:56-0800 [Broker,client] environment: {'TMP': 'C:\\DOCUME~1\\CLTBLD~1.T-X\\LOCALS~1\\Temp', 'COMPUTERNAME': 'T-XP32-IX-116', 'MOZ_NO_ REMOTE': '1', 'USERDOMAIN': 'T-XP32-IX-116', 'KTS_VERSION': '1.19c', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'PROCESSOR_IDENTIFIER': 'x86 Family 6 Model 30 Stepping 5, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '1e05', 'SYSTEMROOT': 'C:\\WINDOWS', 'PAT H': 'C:\\Python24;C:\\Python24\\Scripts;C:\\cygwin\\bin;C:\\WINDOWS\\System32;C:\\program files\\gnuwin32\\bin;C:\\WINDOWS;C:\\mozilla-build\\msys\\bi n;C:\\mozilla-build\\python27;C:\\mozilla-build\\python27\\Scripts;C:\\mozilla-build\\info-zip;C:\\mozilla-build\\wget;C:\\Program Files (x86)\\GnuWin 32\\bin;', 'NO_EM_RESTART': '1', 'XPCOM_DEBUG_BREAK': 'warn', 'TEMP': 'C:\\DOCUME~1\\CLTBLD~1.T-X\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', ' ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld.T-XP32-IX-116', 'US ERNAME': 'cltbld', 'LOGONSERVER': '\\\\T-XP32-IX-116', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'CYGWINBASE': 'C:\\cygwin', 'PAT HEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.PSC1', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIV E': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld.T-XP32-IX-116\\Application Data', 'SYSTEMDRIVE': 'C:', 'KTS_HOME': 'C:\\Program Files\\KTS', 'NUMBER_OF_PROCESSORS': '8', 'PWD': 'C:\\slave\\test', 'PROCESSOR_LEVEL': '6', 'PROPERTIES_FILE': 'C:\\slave\\test/buildprops.json', 'MOZ_CRASHREPORTE R_NO_REPORT': '1', 'OS': 'Windows_NT', 'USERPROFILE': 'C:\\Documents and Settings\\cltbld.T-XP32-IX-116'} 2014-02-13 05:18:56-0800 [Broker,client] using PTY: False 2014-02-13 05:23:10-0800 [-] Log opened. 2014-02-13 05:23:10-0800 [-] twistd 10.2.0 (C:\mozilla-build\buildbotve\scripts\python.exe 2.7.3) starting up. 2014-02-13 05:23:10-0800 [-] reactor class: twisted.internet.selectreactor.SelectReactor. <snip> 2014-02-13 05:41:01-0800 [Broker,client] RunProcess._startCommand 2014-02-13 05:41:01-0800 [Broker,client] 'c:/mozilla-build/python27/python' '-u' 'scripts/external_tools/count_and_reboot.py' '-f' '../reboot_count.t xt' '-n' '1' '-z' 2014-02-13 05:41:01-0800 [Broker,client] in dir C:\\slave\test\. (timeout 1200 secs) 2014-02-13 05:41:01-0800 [Broker,client] watching logfiles {} 2014-02-13 05:41:01-0800 [Broker,client] argv: ['c:/mozilla-build/python27/python', '-u', 'scripts/external_tools/count_and_reboot.py', '-f', '../re boot_count.txt', '-n', '1', '-z'] 2014-02-13 05:41:01-0800 [Broker,client] environment: {'TMP': 'C:\\DOCUME~1\\CLTBLD~1.T-X\\LOCALS~1\\Temp', 'MOZILLABUILD': 'D:\\mozilla-build', 'COM PUTERNAME': 'T-XP32-IX-116', 'MOZ_NO_REMOTE': '1', 'USERDOMAIN': 'T-XP32-IX-116', 'KTS_VERSION': '1.19c', 'LIBPATH': 'C:\\WINDOWS\\Microsoft.NET\\Fram ework\\v2.0.50727;D:\\msvs8\\VC\\ATLMFC\\LIB', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'MOZILLABUILDPATH': '\\mozilla-build\\', 'PROC ESSOR_IDENTIFIER': 'x86 Family 6 Model 30 Stepping 5, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '1e05', 'SYSTEMROOT': 'C:\\WINDOWS', 'PATH': 'C:\\Program Files\\NVIDIA Corporation\\PhysX\\Common;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS \\system32\\WindowsPowerShell\\v1.0;C:\\mozilla-build\\python27;C:\\mozilla-build\\python27\\Scripts;C:\\mozilla-build\\msys\\bin;C:\\mozilla-build\\v im\\vim72;C:\\mozilla-build\\wget;C:\\mozilla-build\\info-zip;C:\\CoreUtils\\bin;C:\\mozilla-build\\buildbotve\\scripts;C:\\mozilla-build\\hg', 'NO_EM _RESTART': '1', 'MSVCDir': 'D:\\msvs8\\VC', 'XPCOM_DEBUG_BREAK': 'warn', 'TEMP': 'C:\\DOCUME~1\\CLTBLD~1.T-X\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE ': 'x86', 'VCVARS': 'D:\\msvs8\\VC\\bin\\vcvars32.bat', 'VSINSTALLDIR': 'D:\\msvs8', 'ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'DevE nvDir': 'D:\\msvs8\\Common7\\IDE', 'MOZILLABUILDDRIVE': 'C:', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld.T-XP32-IX-116', 'FrameworkDir': 'C:\\WINDOWS\\Microsoft.NET\\Framework', 'MOZ_HIDE_RESULTS_TABLE': '1', 'FrameworkVersion': 'v2.0.50727', 'USERNAME': 'cltbld', 'LOGON SERVER': '\\\\T-XP32-IX-116', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'MOZ_TOOLS': 'D:\\mozilla-build\\moztools', 'MOZ_AIRBAG': '1', 'NO_FAIL_ON_TEST_ERRORS': '1', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.PSC1', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld.T-XP32-IX-116\\Application Data', 'MOZ_MSVCVERSION': '8', 'SYSTEMDRIVE': 'C:', 'KTS_HOME': 'C:\\Program Files\\KTS', 'NUMBER_OF_PROCESSORS': '8', 'PWD': 'C:\\slave\\test', 'PROCESSOR_LEVEL': '6', 'PROPE RTIES_FILE': 'C:\\slave\\test/buildprops.json', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'VCINSTALLDIR': 'D:\\msvs8\\VC', 'OS': 'Windows_NT', 'FrameworkSDK Dir': 'D:\\msvs8\\SDK\\v2.0', 'USERPROFILE': 'C:\\Documents and Settings\\cltbld.T-XP32-IX-116'} 2014-02-13 05:41:01-0800 [Broker,client] using PTY: False (end of file)
Assignee: nobody → armenzg
It seems that we're aborting reboots when graceful shutdown fails: 2014-02-12 16:28:06,380 - INFO - t-w732-ix-097 - Graceful shutdown failed, aborting reboot [1] The issue started on 2014-02-12. This started happening after the landings on bug 971737. If we look at slaveapi.log we will see this: 2014-02-12 14:44:28,436 - WARNING - 10.26.40.247 - First password as administrator didn't work. 2014-02-12 14:44:29,828 - INFO - 10.26.40.247 - Connection as root succeeded! ... 2014-02-12 16:26:25,832 - INFO - 10.26.40.247 - Connection as root succeeded! 2014-02-12 16:27:28,541 - WARNING - 10.26.40.247 - First password as administrator didn't work. This mean that we try for almost 2 hours and then give up. [1] [cltbld@buildbot-master65.srv.releng.usw2.mozilla.com slaverebooter]$ grep "w732-ix-097" slaverebooter.log | grep "Graceful" 2014-01-11 16:44:31,605 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting 2014-01-18 12:44:22,586 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting 2014-01-19 00:44:24,587 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting 2014-01-19 20:44:27,021 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting 2014-01-30 20:44:37,073 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting 2014-02-02 04:44:27,932 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting 2014-02-02 08:44:39,391 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting 2014-02-03 16:44:31,856 - INFO - t-w732-ix-097 - Graceful shutdown finished, rebooting 2014-02-12 16:28:06,380 - INFO - t-w732-ix-097 - Graceful shutdown failed, aborting reboot [2] 2014-02-12 14:44:23,369 - INFO - t-w732-ix-097 - Getting slavealloc info 2014-02-12 14:44:23,370 - INFO - Making request to: http://slavealloc.build.mozilla.org/api/slaves/t-w732-ix-097?byname=1 2014-02-12 14:44:23,405 - INFO - t-w732-ix-097 - Getting inventory info 2014-02-12 14:44:23,809 - INFO - t-w732-ix-097 - Getting bug info 2014-02-12 14:44:23,809 - INFO - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-w732-ix-097 2014-02-12 14:44:23,987 - INFO - t-w732-ix-097 - Getting recent job info 2014-02-12 14:44:24,103 - INFO - 10.132.49.112 - - [2014-02-12 14:44:24] "GET /slaves/t-w732-ix-097 HTTP/1.1" 200 809 0.738124 2014-02-12 14:44:24,215 - INFO - 10.132.49.112 - - [2014-02-12 14:44:24] "POST /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30 HTTP/1.1" 202 171 0.002082 2014-02-12 14:44:24,218 - INFO - t-w732-ix-097 - Getting slavealloc info 2014-02-12 14:44:24,219 - INFO - Making request to: http://slavealloc.build.mozilla.org/api/slaves/t-w732-ix-097?byname=1 2014-02-12 14:44:27,277 - INFO - t-w732-ix-097 - Starting graceful shutdown. 2014-02-12 14:44:54,354 - INFO - 10.132.49.112 - - [2014-02-12 14:44:54] "GET /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30&requestid=68555728 HTTP/1.1" 200 139 0.000738 2014-02-12 14:45:24,497 - INFO - 10.132.49.112 - - [2014-02-12 14:45:24] "GET /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30&requestid=68555728 HTTP/1.1" 200 139 0.000772 2014-02-12 14:45:54,638 - INFO - 10.132.49.112 - - [2014-02-12 14:45:54] "GET /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30&requestid=68555728 HTTP/1.1" 200 139 0.000780 2014-02-12 14:46:24,778 - INFO - 10.132.49.112 - - [2014-02-12 14:46:24] "GET /slaves/t-w732-ix-097/actions/shutdown_buildslave?waittime=30&requestid=68555728 HTTP/1.1" 200 139 0.000849
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #1) > It seems that we're aborting reboots when graceful shutdown fails: > 2014-02-12 16:28:06,380 - INFO - t-w732-ix-097 - Graceful shutdown failed, > aborting reboot [1] "aborting reboot" here means "slaverebooter will not trigger a reboot through slaveapi". As far as I know, it should have no effect an the reboot at the end of the job.
bhearsum, should we had seen anything on the logs for t-w732-ix-097 from the 6th (when it stopped taking jobs) until the 12th? Was slaverebooter not working during that time? The question is not necessarily why it did not reboot but why it did not start buildbot? The machine did reboot. I don't know more about why the output was not clearer on twistd.log. I can see that the talosslave startup task run at 2014-02-06 4:46:08pm and the operation completed successfully at 4:47:01pm: > Task Scheduler successfully completed task "\talosslave" , instance > "{a5de2c77-ce18-4fb9-9bac-5219416762c9}" , action "C:\windows\SYSTEM32\cmd.exe" with return code 0. Not sure if any of the following is relevant or not: 2014-02-06 4:46:08PM - After the machine started up I saw this Error message ################ This computer was not able to set up a secure session with a domain controller in domain RELENG due to the following: There are currently no logon servers available to service the logon request. This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator. ADDITIONAL INFO If this computer is a domain controller for the specified domain, it sets up the secure session to the primary domain controller emulator in the specified domain. Otherwise, this computer sets up the secure session to any domain controller in the specified domain. ###############
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #3) > bhearsum, should we had seen anything on the logs for t-w732-ix-097 from the > 6th (when it stopped taking jobs) until the 12th? Was slaverebooter not > working during that time? It ran once on the 6th: 2014-02-06 16:44:17,354 - INFO - t-w732-ix-096 - Skipping reboot because last job ended recently at Thursday, February 06, 15:39 But was otherwise down because I had to make some fixes after https://bugzilla.mozilla.org/show_bug.cgi?id=931854.
Summary: windows ix test machines don't always reboot → Windows ix test machines don't always start buildbot
armen, are you still looking into this, do we have any ETA's etc? I see 15 no-job w7's and 21 XP's atm. Slaverebooter seems to be failing to shutdown the buildslave instance (as is a known bug there). But I'm suspecting there is some systemic issue here that could/should be triaged.
Flags: needinfo?(armenzg)
No, I don't currently have time for this. I added some logging in bug 973004 like this: C:\mozilla-build\python27\python c:\slave\runslave.py > C:\slave\runslave.log
Flags: needinfo?(armenzg)
dropping armen from assignee to properly reflect c#6 bug state
Assignee: armenzg → nobody
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2045]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2045] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2054]
No activity + runner should supervise buildbot.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.