Closed Bug 645153 Opened 13 years ago Closed 13 years ago

XP debug jobs are sometimes hanging

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Attachments

(2 files, 3 obsolete files)

I have seen a lot of xpcshell jobs hang but there are another 3-4 test suites that I have seen fail.

Most of them don't necessarily have the "send error report" dialog.
Most of them they just stop running the tests without any type of reason. No message. Nothing.

Then buildbot tries to kill it but it can because it doesn't have the twisted patch and we end up with ProcessExitedAlready issue (see bug 626486).

I can reproduce the problem manually by running this (on a slave that I know was having the issue):
bash -c 'if [ ! -d firefox/plugins ]; then mkdir firefox/plugins; fi && cp bin/xpcshell.exe firefox && cp -R bin/components/* firefox/components/ && cp -R bin/plugins/* firefox/plugins/ && python -u xpcshell/runxpcshelltests.py --symbols-path=symbols --manifest=xpcshell/tests/all-test-dirs.list firefox/xpcshell.exe'

For instance it stopped at this test:
> TEST-INFO | c:\talos-slave\test\build\xpcshell\tests\netwerk\test\unit\test_dns_service.js | running test ...

I will be looking at deploying the twisted patch and see if it fixes the issue (attachment 499149 [details] [diff] [review]).

Where can I look for a log or something?

For now the way to get it out of that state is to reboot the slave with "shutdown -f -r -t 0"
An example of debug jsreftest (no pop up):
++DOMWINDOW == 5404 (342D4B48) [serial = 5405] [outer = 06A4E0C8]
REFTEST TEST-START | file:///C:/talos-slave/test/build/jsreftest/tests/jsreftest.html?test=js1_8/regress/regress-464418.js
++DOMWINDOW == 5405 (34242928) [serial = 5406] [outer = 06A4E0C8]
BUGNUMBER: 464418
STATUS: Do not assert: fp->slots + fp->script->nfixed + js_ReconstructStackDepth(cx, fp->script, fp->regs->pc) == fp->regs->sp
TEST-UNEXPECTED-FAIL | file:///C:/talos-slave/test/build/jsreftest/tests/jsreftest.html?test=js1_8/regress/regress-464418.js | application timed out after 330 seconds with no output

command timed out: 1200 seconds without output
debug reftest is running extremely slow on talos-r3-xp-053 (I am rebooting it).
All other debug reftest jobs seems to have been taking less than 50 minutes to complete while this job was already over 2 hours and only 1300 tests out of >5000 to be run.
talos-r3-xp-029 with a debug xpcshell with a "send report" prompt and ProcessAlreadyExited.

talos-r3-xp-022 with an *opt* xpcshell job. No prompt. ProcessAlreadyExited

talos-r3-xp-012 with debug xpcshell. No prompt. ProcessAlreadyExited
Could someone tell me what this test is doing?
http://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/test_dns_service.js

How can I run that test individually?
Assignee: nobody → armenzg
I have added on staging two xp slaves (xp-002 and xp-003).
The second one has the twisted patches:
wget -Osubproc.patch --no-check-certificate https://bug614955.bugzilla.mozilla.org/attachment.cgi?id=499149
cd C:\mozilla-build\python25\Lib\site-packages
C:\mozilla-build\msys\bin\patch -p1 < "C:\Documents and Settings\cltbld\subproc.patch"

I will review how things are going after the weekend.
I will meanwhile be looking several times a day for stranded XP slaves to reboot.

If things get very bad all we have to do is to disable XP debug unit tests:
http://hg.mozilla.org/build/buildbot-configs/file/tip/mozilla-tests/config.py#l264
by changing it to an empty list.
(In reply to comment #4)
> Could someone tell me what this test is doing?
> http://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/test_dns_service.js
> 
> How can I run that test individually?

SOLO_FILE=test_dns_service.js make -C ff-objdir/netwerk/test/ check-one
The patch deploys the new file into C:\mozilla\build\python25\Lib\site-packages\twisted\internet\_dumbwin32proc.py.

This package also allows us to revert to the original file.
Attachment #522425 - Flags: review?(dustin)
Comment on attachment 522425 [details] [diff] [review]
[opsi] replace _dumbwin32proc.py with version that allows buildbot to kill jobs

Looks good.  My only concern is ordering: if this gets "installed" before the Twisted package does, its changes will be overwritten.  Is there a way to control that with OPSI?
Attachment #522425 - Flags: review?(dustin) → review+
We don't deploy Twisted through OPSI on XP slaves so we don't have to worry about it :)
I thought the first patch worked but it was creating a _dumbwin32proc.py directory instead of copying the file.

I have modified the patch adding double quotes and adding a "new" and "original" directories but it is not working on staging.

I guess I will have to keep on rebooting slaves until tomorrow.

...
scriptname: "twisted_dumbwin32proc.ins", special path: "P:\install\twisted_dumbwin32proc\"
...

============ Version 4.8.8.1 WIN32 script "P:\install\twisted_dumbwin32proc\twisted_dumbwin32proc.ins"
             start: 2011-03-28  13:23:29  (on client named as : "talos-r3-xp-003.build.mozilla.org")
[executing: "C:\Program Files\opsi.org\preloginloader\opsi-winst\winst32.exe"]
system infos:
D4:9A:20:BC:E1:E0  -  PC hardware address
talos-r3-xp-003.build.mozilla.org  -  IP name 
10.12.50.111  -  IP address
ENU  -  System default locale 


Execution of Files_twisted
  Error:  Directory P:\install\twisted_dumbwin32proc\new\_dumbwin32proc.py does not exist and cannot be created
___________________
1 error
0 warnings


no script found for file name ""
nthomas, rail, bhearsum: do you see anything that I am doing obviously wrong in the patch?
Nothing jumps out at me. Maybe doublecheck the syntax for winst's copy utility ? And you'll have to be super careful about the slave state, as well as building and registering the opsi package.
Finally!
I was using the wrong winst syntax.
DosInAnIcon is the way to go.
Files treats it as directories and it is generally used with directory\*.*

This worked to install the package and to uninstall it.

I will deploy this after lunch.
Attachment #522425 - Attachment is obsolete: true
Attachment #522483 - Attachment is obsolete: true
Attachment #522704 - Flags: review?(dustin)
Comment on attachment 522704 [details] [diff] [review]
[opsi] replace _dumbwin32proc.py with version that allows buildbot to kill jobs

I'm still worried about the ordering here.  We could end up with some systems actually having this patch and others not having it.

Also, the install should probably make a backup copy of what's already installed.  The strategy of uninstalling by installing a different file may lead to confusion and unhappiness.  You could also just omit the uninstall step entirely.
Attachment #522704 - Flags: review?(dustin) → review-
This tackles what you wanted and has been tested on staging for install and uninstall scenarios.

As per IRC ordering does not matter as Twisted comes from the ref machine and not from OPSI.
Attachment #522704 - Attachment is obsolete: true
Attachment #522761 - Flags: review?(dustin)
Attachment #522761 - Flags: review?(dustin) → review+
Comment on attachment 522761 [details] [diff] [review]
[opsi] replace _dumbwin32proc.py with version that allows buildbot to kill jobs (take 2)

This got landed with this:
http://hg.mozilla.org/build/opsi-package-sources/rev/520de951bbb0

All XP slaves have been marked for this to get deployed.

I hope this makes the cut to fix this issue.

The maintenance page and the Reference page have been updated:
https://wiki.mozilla.org/ReferencePlatforms/Test/WinXP#Twisted_patch_to_allow_buildbot_to_kill_jobs

There is nothing to be done with the ref machine AFAIU.

It was believed that this patch already existed on the XP slaves since it was part of the rev2 machines.
The change could have been lost in a mozilla-build re-install or the python update (from 2.4 to 2.5). We don't really know.

For historical purposes here they are bug 420216 and bug 537751.
Attachment #522761 - Flags: checked-in+
I have reverted this change on the XP slaves as the reboot step for talos jobs gets affected.

We had to RDP to each talos job that was stuck on the reboot step since a reboot was already initiated but stucked on a prompt.

Everything is back to normal. I will continue to reboot the usual XP debug jobs until tomorrow.

On unit tests:
http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#453
VS direct call of count_and_reboot.py
http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#7600

Talos output:
2011-03-29 12:16:41-0800 [Broker,client]   in dir C:\talos-slave\test\../talos-d
ata (timeout 1200 secs)
2011-03-29 12:16:41-0800 [Broker,client]   watching logfiles {}
2011-03-29 12:16:41-0800 [Broker,client]   argv: ['python', 'count_and_reboot.py
', '-f', '../talos_count.txt', '-n', '1', '-z']
2011-03-29 12:16:41-0800 [Broker,client]  environment: {'TMP': 'C:\\DOCUME~1\\cl
tbld\\LOCALS~1\\Temp', 'COMPUTERNAME': 'TALOS-R3-XP-007', 'MOZ_NO_REMOTE': '1',
'USERDOMAIN': 'TALOS-R3-XP-007', 'TACFILE': '"c:\\talos-slave\\buildbot.tac"', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'PROCESSOR_IDENTIFIER': 'x86 Family 6 Model 23 Stepping 10, GenuineIntel', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_REVISION': '170a', 'SYSTEMROOT': 'C:\\WINDOWS', 'PATH': 'C:\\Python24;C:\\Python24\\Scripts;C:\\cygwin\\bin;C:\\WINDOWS\\Syste
m32;C:\\program files\\gnuwin32\\bin;C:\\WINDOWS;', 'NO_EM_RESTART': '1', 'BB_BUILDBOT': '"C:\\mozilla-build\\python25\\Scripts\\buildbot" ', 'XPCOM_DEBUG_BREAK': 'warn', 'TACSCRIPT': '"c:\\tools\\buildbot-helpers\\buildbot-tac.py"', 'TEMP': 'C:\\DOCUME~1\\cltbld\\LOCALS~1\\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', 'ALLUSERSPROFILE': 'C:\\Documents and Settings\\All Users', 'SESSIONNAME': 'Console', 'HOMEPATH': '\\Documents and Settings\\cltbld', 'USERNAME': 'cltbld', 'LOGONSERVER': '\\\\TALOS-R3-XP-007', 'PROMPT': '$P$G', 'COMSPEC': 'C:\\WINDOWS\\system32\\cmd.exe', 'CYGWINBASE': 'C:\\cygwin', 'BOOTMODE': 'BKSTD', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH', 'PASSWORD': '"secret"', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'WINDIR': 'C:\\WINDOWS', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\\Documents and Settings\\cltbld\\Application Data', 'TYPE': '~0,5', 'SYSTEMDRIVE': 'C:', 'HOSTNAME': 'talos-r3-xp-007', 'NUMBER_OF_PROCESSORS': '2', 'PWD': 'C:\\talos-slave\\talos-data', 'PROCESSOR_LEVEL': '6', 'BB_PYTHON': '"C:\\mozilla-build\\python25\\Scripts\\..\\python"', 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'CONTROLFILE': '"c:\\buildbot-tac.control"', 'OS': 'Windows_NT', 'US
ERPROFILE': 'C:\\Documents and Settings\\cltbld'}
2011-03-29 12:16:41-0800 [Broker,client]   closing stdin
2011-03-29 12:16:41-0800 [Broker,client]   using PTY: False
2011-03-29 12:17:12-0800 [-] Received SIGBREAK, shutting down.
2011-03-29 12:17:12-0800 [-] stopCommand: halting current command <buildbot.slave.commands.base.SlaveShellCommand instance at 0x014351C0>
2011-03-29 12:17:12-0800 [-] command interrupted, killing pid 3500
2011-03-29 12:17:12-0800 [-] trying process.signalProcess('KILL')
The new _dumbwin32proc.py made somehow the talos jobs to not kill everything (I did not manage to reproduce on staging).

The difference between unit test jobs and talos jobs is that one pulls the latest tools and the other one doesn't.
The latest count_and_reboot.py has "shutdown -f -r -t 0" instead of "shutdown -r".

The tools checkout has been updated to the latest:
http://build.mozilla.org/talos/tools/buildfarm/maintenance/count_and_reboot.py

This means that now the talos jobs can reboot even if there is any prompts.

I have marked slaves talos-r3-xp-0[04-30] (except #5) to pick up the new _dumbwin32proc.py file on reboot (from 04 to 25 at 11:30am PDT, 26 to 30 at 13:15pm PDT).

If everything goes well I will mark the remaining slaves early in the morning.

I will keep an eye on the XP slaves until tomorrow morning.
I marked slaves talos-r3-xp-005 regardless of its "down state" as I had assumed it to be reimaged at some point.
All XP slaves have been marked to get the new _dumbwin32proc.py

No known issues and happily rebooting now :)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: