Closed
Bug 645153
Opened 13 years ago
Closed 13 years ago
XP debug jobs are sometimes hanging
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: armenzg)
References
Details
Attachments
(2 files, 3 obsolete files)
12.16 KB,
text/plain
|
Details | |
14.62 KB,
patch
|
dustin
:
review+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
I have seen a lot of xpcshell jobs hang but there are another 3-4 test suites that I have seen fail. Most of them don't necessarily have the "send error report" dialog. Most of them they just stop running the tests without any type of reason. No message. Nothing. Then buildbot tries to kill it but it can because it doesn't have the twisted patch and we end up with ProcessExitedAlready issue (see bug 626486). I can reproduce the problem manually by running this (on a slave that I know was having the issue): bash -c 'if [ ! -d firefox/plugins ]; then mkdir firefox/plugins; fi && cp bin/xpcshell.exe firefox && cp -R bin/components/* firefox/components/ && cp -R bin/plugins/* firefox/plugins/ && python -u xpcshell/runxpcshelltests.py --symbols-path=symbols --manifest=xpcshell/tests/all-test-dirs.list firefox/xpcshell.exe' For instance it stopped at this test: > TEST-INFO | c:\talos-slave\test\build\xpcshell\tests\netwerk\test\unit\test_dns_service.js | running test ... I will be looking at deploying the twisted patch and see if it fixes the issue (attachment 499149 [details] [diff] [review]). Where can I look for a log or something? For now the way to get it out of that state is to reboot the slave with "shutdown -f -r -t 0"
Assignee | ||
Comment 1•13 years ago
|
||
An example of debug jsreftest (no pop up): ++DOMWINDOW == 5404 (342D4B48) [serial = 5405] [outer = 06A4E0C8] REFTEST TEST-START | file:///C:/talos-slave/test/build/jsreftest/tests/jsreftest.html?test=js1_8/regress/regress-464418.js ++DOMWINDOW == 5405 (34242928) [serial = 5406] [outer = 06A4E0C8] BUGNUMBER: 464418 STATUS: Do not assert: fp->slots + fp->script->nfixed + js_ReconstructStackDepth(cx, fp->script, fp->regs->pc) == fp->regs->sp TEST-UNEXPECTED-FAIL | file:///C:/talos-slave/test/build/jsreftest/tests/jsreftest.html?test=js1_8/regress/regress-464418.js | application timed out after 330 seconds with no output command timed out: 1200 seconds without output
Assignee | ||
Comment 2•13 years ago
|
||
debug reftest is running extremely slow on talos-r3-xp-053 (I am rebooting it). All other debug reftest jobs seems to have been taking less than 50 minutes to complete while this job was already over 2 hours and only 1300 tests out of >5000 to be run.
Assignee | ||
Comment 3•13 years ago
|
||
talos-r3-xp-029 with a debug xpcshell with a "send report" prompt and ProcessAlreadyExited. talos-r3-xp-022 with an *opt* xpcshell job. No prompt. ProcessAlreadyExited talos-r3-xp-012 with debug xpcshell. No prompt. ProcessAlreadyExited
Assignee | ||
Comment 4•13 years ago
|
||
Could someone tell me what this test is doing? http://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/test_dns_service.js How can I run that test individually?
Assignee | ||
Comment 5•13 years ago
|
||
Assignee: nobody → armenzg
Assignee | ||
Comment 6•13 years ago
|
||
I have added on staging two xp slaves (xp-002 and xp-003). The second one has the twisted patches: wget -Osubproc.patch --no-check-certificate https://bug614955.bugzilla.mozilla.org/attachment.cgi?id=499149 cd C:\mozilla-build\python25\Lib\site-packages C:\mozilla-build\msys\bin\patch -p1 < "C:\Documents and Settings\cltbld\subproc.patch" I will review how things are going after the weekend. I will meanwhile be looking several times a day for stranded XP slaves to reboot. If things get very bad all we have to do is to disable XP debug unit tests: http://hg.mozilla.org/build/buildbot-configs/file/tip/mozilla-tests/config.py#l264 by changing it to an empty list.
Comment 8•13 years ago
|
||
(In reply to comment #4) > Could someone tell me what this test is doing? > http://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/test_dns_service.js > > How can I run that test individually? SOLO_FILE=test_dns_service.js make -C ff-objdir/netwerk/test/ check-one
Assignee | ||
Comment 9•13 years ago
|
||
The patch deploys the new file into C:\mozilla\build\python25\Lib\site-packages\twisted\internet\_dumbwin32proc.py. This package also allows us to revert to the original file.
Attachment #522425 -
Flags: review?(dustin)
Comment 10•13 years ago
|
||
Comment on attachment 522425 [details] [diff] [review] [opsi] replace _dumbwin32proc.py with version that allows buildbot to kill jobs Looks good. My only concern is ordering: if this gets "installed" before the Twisted package does, its changes will be overwritten. Is there a way to control that with OPSI?
Attachment #522425 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 11•13 years ago
|
||
We don't deploy Twisted through OPSI on XP slaves so we don't have to worry about it :)
Assignee | ||
Comment 12•13 years ago
|
||
I thought the first patch worked but it was creating a _dumbwin32proc.py directory instead of copying the file. I have modified the patch adding double quotes and adding a "new" and "original" directories but it is not working on staging. I guess I will have to keep on rebooting slaves until tomorrow. ... scriptname: "twisted_dumbwin32proc.ins", special path: "P:\install\twisted_dumbwin32proc\" ... ============ Version 4.8.8.1 WIN32 script "P:\install\twisted_dumbwin32proc\twisted_dumbwin32proc.ins" start: 2011-03-28 13:23:29 (on client named as : "talos-r3-xp-003.build.mozilla.org") [executing: "C:\Program Files\opsi.org\preloginloader\opsi-winst\winst32.exe"] system infos: D4:9A:20:BC:E1:E0 - PC hardware address talos-r3-xp-003.build.mozilla.org - IP name 10.12.50.111 - IP address ENU - System default locale Execution of Files_twisted Error: Directory P:\install\twisted_dumbwin32proc\new\_dumbwin32proc.py does not exist and cannot be created ___________________ 1 error 0 warnings no script found for file name ""
Assignee | ||
Comment 13•13 years ago
|
||
nthomas, rail, bhearsum: do you see anything that I am doing obviously wrong in the patch?
Comment 14•13 years ago
|
||
Nothing jumps out at me. Maybe doublecheck the syntax for winst's copy utility ? And you'll have to be super careful about the slave state, as well as building and registering the opsi package.
Assignee | ||
Comment 15•13 years ago
|
||
Finally! I was using the wrong winst syntax. DosInAnIcon is the way to go. Files treats it as directories and it is generally used with directory\*.* This worked to install the package and to uninstall it. I will deploy this after lunch.
Attachment #522425 -
Attachment is obsolete: true
Attachment #522483 -
Attachment is obsolete: true
Attachment #522704 -
Flags: review?(dustin)
Comment 16•13 years ago
|
||
Comment on attachment 522704 [details] [diff] [review] [opsi] replace _dumbwin32proc.py with version that allows buildbot to kill jobs I'm still worried about the ordering here. We could end up with some systems actually having this patch and others not having it. Also, the install should probably make a backup copy of what's already installed. The strategy of uninstalling by installing a different file may lead to confusion and unhappiness. You could also just omit the uninstall step entirely.
Attachment #522704 -
Flags: review?(dustin) → review-
Assignee | ||
Comment 17•13 years ago
|
||
This tackles what you wanted and has been tested on staging for install and uninstall scenarios. As per IRC ordering does not matter as Twisted comes from the ref machine and not from OPSI.
Attachment #522704 -
Attachment is obsolete: true
Attachment #522761 -
Flags: review?(dustin)
Updated•13 years ago
|
Attachment #522761 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 18•13 years ago
|
||
Comment on attachment 522761 [details] [diff] [review] [opsi] replace _dumbwin32proc.py with version that allows buildbot to kill jobs (take 2) This got landed with this: http://hg.mozilla.org/build/opsi-package-sources/rev/520de951bbb0 All XP slaves have been marked for this to get deployed. I hope this makes the cut to fix this issue. The maintenance page and the Reference page have been updated: https://wiki.mozilla.org/ReferencePlatforms/Test/WinXP#Twisted_patch_to_allow_buildbot_to_kill_jobs There is nothing to be done with the ref machine AFAIU. It was believed that this patch already existed on the XP slaves since it was part of the rev2 machines. The change could have been lost in a mozilla-build re-install or the python update (from 2.4 to 2.5). We don't really know. For historical purposes here they are bug 420216 and bug 537751.
Attachment #522761 -
Flags: checked-in+
Assignee | ||
Comment 19•13 years ago
|
||
I have reverted this change on the XP slaves as the reboot step for talos jobs gets affected. We had to RDP to each talos job that was stuck on the reboot step since a reboot was already initiated but stucked on a prompt. Everything is back to normal. I will continue to reboot the usual XP debug jobs until tomorrow. On unit tests: http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#453 VS direct call of count_and_reboot.py http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#7600 Talos output: 2011-03-29 12:16:41-0800 [Broker,client] in dir C:\talos-slave\test\../talos-d ata (timeout 1200 secs) 2011-03-29 12:16:41-0800 [Broker,client] watching logfiles {} 2011-03-29 12:16:41-0800 [Broker,client] argv: ['python', 'count_and_reboot.py ', '-f', '../talos_count.txt', '-n', '1', '-z'] 2011-03-29 12:16:41-0800 [Broker,client] environment: {'TMP': 'C:\\DOCUME~1\\cl tbld\\LOCALS~1\\Temp', 'COMPUTERNAME': 'TALOS-R3-XP-007', 'MOZ_NO_REMOTE': '1', 'USERDOMAIN': 'TALOS-R3-XP-007', 'TACFILE': '"c:\\talos-slave\\buildbot.tac"', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'PROCESSOR_IDENTIFIER': 'x86 Family 6 Model 23 Stepping 10, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '170a', 'SYSTEMROOT': 'C:\\WINDOWS', 'PATH': 'C:\\Python24;C:\\Python24\\Scripts;C:\\cygwin\\bin;C:\\WINDOWS\\Syste m32;C:\\program files\\gnuwin32\\bin;C:\\WINDOWS;', 'NO_EM_RESTART': '1', 'BB_BUILDBOT': '"C:\\mozilla-build\\python25\\Scripts\\buildbot" ', 'XPCOM_DEBUG_BREAK': 'warn', 'TACSCRIPT': '"c:\\tools\\buildbot-helpers\\buildbot-tac.py"', 'TEMP': 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', 'ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld', 'USERNAME': 'cltbld', 'LOGONSERVER': '\\\\TALOS-R3-XP-007', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'CYGWINBASE': 'C:\\cygwin', 'BOOTMODE': 'BKSTD', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH', 'PASSWORD': '"secret"', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld\\Application Data', 'TYPE': '~0,5', 'SYSTEMDRIVE': 'C:', 'HOSTNAME': 'talos-r3-xp-007', 'NUMBER_OF_PROCESSORS': '2', 'PWD': 'C:\\talos-slave\\talos-data', 'PROCESSOR_LEVEL': '6', 'BB_PYTHON': '"C:\\mozilla-build\\python25\\Scripts\\..\\python"', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'CONTROLFILE': '"c:\\buildbot-tac.control"', 'OS': 'Windows_NT', 'US ERPROFILE': 'C:\\Documents and Settings\\cltbld'} 2011-03-29 12:16:41-0800 [Broker,client] closing stdin 2011-03-29 12:16:41-0800 [Broker,client] using PTY: False 2011-03-29 12:17:12-0800 [-] Received SIGBREAK, shutting down. 2011-03-29 12:17:12-0800 [-] stopCommand: halting current command <buildbot.slave.commands.base.SlaveShellCommand instance at 0x014351C0> 2011-03-29 12:17:12-0800 [-] command interrupted, killing pid 3500 2011-03-29 12:17:12-0800 [-] trying process.signalProcess('KILL')
Assignee | ||
Comment 20•13 years ago
|
||
The new _dumbwin32proc.py made somehow the talos jobs to not kill everything (I did not manage to reproduce on staging). The difference between unit test jobs and talos jobs is that one pulls the latest tools and the other one doesn't. The latest count_and_reboot.py has "shutdown -f -r -t 0" instead of "shutdown -r". The tools checkout has been updated to the latest: http://build.mozilla.org/talos/tools/buildfarm/maintenance/count_and_reboot.py This means that now the talos jobs can reboot even if there is any prompts. I have marked slaves talos-r3-xp-0[04-30] (except #5) to pick up the new _dumbwin32proc.py file on reboot (from 04 to 25 at 11:30am PDT, 26 to 30 at 13:15pm PDT). If everything goes well I will mark the remaining slaves early in the morning. I will keep an eye on the XP slaves until tomorrow morning.
Assignee | ||
Comment 21•13 years ago
|
||
I marked slaves talos-r3-xp-005 regardless of its "down state" as I had assumed it to be reimaged at some point.
Assignee | ||
Comment 22•13 years ago
|
||
All XP slaves have been marked to get the new _dumbwin32proc.py No known issues and happily rebooting now :)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•