Closed Bug 600736 Opened 14 years ago Closed 2 years ago

Need to modify twistd patch on all Win32 slaves to properly manage processes

Categories

(Testing :: General, defect)

x86
Windows XP
defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: cmtalbert, Unassigned)

References

()

Details

Attachments

(2 files)

So, we learned from Alice that the twistd instance on all the win32 talos slaves has the _dumbwin32proc.py twistd file patched to include the code attached in the pastebin. This code has the effect of enabling twistd to manage all processes that are spawned in one job object. This means it is easy for Twistd to kill downstream processes. This is good until a downstream application wants to control spawned processes more directly. This is what Mozmill does on Windows. Mozmill creates its own job object so that it can manipulate Firefox at will. This is necessary to run our restart tests properly and ensure that Firefox is completely shut down when we do the "restart" in the test. In order for this to work in buildbot, we need to add the JOB_OBJECT_LIMIT_BREAKAWAY_OK constant to the creationflags of the spawned process in twistd. The change to the pastebin'ed code would merely be a change to the CreateProcess call, i.e: # start the process as suspended self.hProcess, self.hThread, self.dwPid, self.dwTid = win32process.CreateProcess( command, cmdline, None, None, 1, CREATE_SUSPENDED | JOB_OBJECT_LIMIT_BREAKAWAY_OK, env, path, StartupInfo) This will allow any downstream process to break out of job control and to manage their own processes. I think this is going to be necessary anyway for E10S firefox as it will need to do the same thing (I imagine). Jhammel is going to test this twistd change in our tools-staging system as we finish out the rest of the necessary Mozmill support for this change. We will need this to bring bug 516984 live on windows, and as such, I've marked it a major bug. I believe the overall risk of such a change is low since the slaves reboot after each run which will eliminate any rogue processes.
The fix actually needs to be done against the job object. Once created the LimitFlags need to have the value JOB_OBJECT_LIMIT_BREAKAWAY_OK or'd in. I would also suggest or'ing in the value JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE which will kill all of the processes in the job if twistd itself dies. This needs to be done before assigning the new process to the job. Something like: hJob = CreateJobObject(NULL, str(time()); if (hJob != NULL) { JOBOBJECT_EXTENDED_LIMIT_INFORMATION jeli = { 0 }; jeli.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE | JOB_OBJECT_LIMIT_BREAKAWAY_OK; if( 0 == SetInformationJobObject( hJob, JobObjectExtendedLimitInformation, &jeli, sizeof(jeli))) { ::MessageBox( 0, "Could not SetInformationJobObject", "Error", MB_OK); } } This will allow child processes (like Mozmill) to create jobs to manage their own spawned processes on Windows.
We'll want to be sure that this doesn't affect buildbot's ability to kill all active processes on shutdown - which is something that we rely on to keep a clean system.
See Also: → 600321
Component: Release Engineering → General
Product: mozilla.org → Testing
QA Contact: release → general
Version: other → unspecified
Bear, why did this get moved?
When the twisted patch is ready from your point of view and accepted by upstream, reassign the bug to us so we can update our packages and then start testing.
Assignee: nobody → jhammel
No, we're not waiting for Twisted to fix this. There is already a patch applied to twisted on all the slaves. We're going to fix that patch ourselves and ensure it works in the tools-staging environment, then we'll mark this as "needs downtime" and get it included in the production environments. We'll submit the patch to twisted of course, but we're going to use it whether or not they take it.
So I've taken tools-r3-w7-001 to try to test out a fix for this. So far failures are mysterious and I haven't even gotten to actually fixing things yet. For sure, this should be reverted from this slave before anyone else uses it. My WIP is here: http://people.mozilla.org/~jhammel/killprocess/_doCreate.py Currently, this doesn't do anything that isn't in :anode's patch. But of course it still doesn't work. I also patch _dumbwin32proc.py to use the above code. I'm a bit confused on the state of twisted on the slaves. Is there any reason that we patch the twisted code rather than using the vendor branch we already have? http://hg.mozilla.org/build/twisted/ Or am I confused? If we used twisted from a VCS, patching + testing on the slave would be much easier and more expedient
Hi guys, To enable unit tests on the XP slaves I have to ensure that we can kill processes. To do so I will have to match _dumbwin32proc.py to what we have on the win7 and win7x64 machines. The patch is this: https://bug614955.bugzilla.mozilla.org/attachment.cgi?id=499063 Could you please repost http://pastebin.mozilla.org/801026 so I can see what where you referring to? What is the status on this bug?
(In reply to comment #11) > Could you please repost http://pastebin.mozilla.org/801026 so I can see what > where you referring to? > > What is the status on this bug? I haven't worked on this bug in awhile, though it is still an outstanding issue. I'm not sure what that pastebin is or where it is referred to here. The issue in short is that downstream job objects should be able to kill themselves. Currently they do not.
(In reply to comment #11) > Hi guys, > To enable unit tests on the XP slaves I have to ensure that we can kill > processes. > To do so I will have to match _dumbwin32proc.py to what we have on the win7 and > win7x64 machines. > > The patch is this: > https://bug614955.bugzilla.mozilla.org/attachment.cgi?id=499063 > > Could you please repost http://pastebin.mozilla.org/801026 so I can see what > where you referring to? > > What is the status on this bug? I forgot to mention that we never needed this after all. We added wbem into the PATH: https://bug614955.bugzilla.mozilla.org/attachment.cgi?id=501130 and added "-f" to our reboot step: https://bug614955.bugzilla.mozilla.org/attachment.cgi?id=499589
I don't understand. We found that this patch was already in place and so we turned off mozmill's ability to manage it's own job objects. This means mozmill DEPENDS on twistd to manage the job object. If you created new builders without this twistd patch in place, then mozmill should have been smart enough to detect that and start managing its own job objects again. This still does not explain why we see mozmill hangs on other os's though.
To clarify my previous comment I was referring to XP slaves (see comment 11). win7 and win7x64 have the twisted patched. winXP slaves do *not* have the twisted patch.
Depends on: 640047
From the perspective of deploying this on the production slaves (bug 637347), and after some conversation with IRC, here's my understanding: ctalbert will work on a patch that we can include with buildbot-slave, rather than in Twisted. Preferably, that would be upstreamable. At worst, it would be a monkeypatch that we can install via the buildbot-slave package, avoiding the need to install a customized Twisted. Upstreamable would mean: - no regressions on older versions of Windows, even if new functionality isn't available - compatible with multiple versions of Twisted, or at least with graceful fallback - old behavior (not killing subprocesses) is selectable via the useProcGroup parameter or a new parameter I'm going to hold off deploying the newer version of Buildbot to Windows systems until this is ready to roll, at which point we can do the usual bake-in-staging dance before installing it everywhere (via OPSI and VNC).
Since the do_create method is rather a monolith, I doubt this can be sensibly monkey patched, sadly: http://twistedmatrix.com/trac/browser/trunk/twisted/internet/_dumbwin32proc.py#L172 With respect to upstreamable, see comment 3. I looked at this a bit (meaning for quite a lot of time without quite a lot of results) and found the calling chain to be just horrendous. I'm not sure if I have a single good recommendation, but I think monkey-patching will be hard and probably impossible to do in a good way.
Right, I don't think Mozilla should take on the relevant Twisted bug, but I would like to see this functionality corrected in Buildbot. If it can be patched, it can be monkey-patched, just at the cost of duplicating a lot of code and limiting applicability to a more narrow range of Twisted versions. Depending on how much code is duplicated, I'll make the call about merging into Buildbot upstream. Let's get something that works, and see if we can whittle down the code duplication.
The alteration to _dumbwin32proc.py was rolled out to all xp talos boxes are part of bug 420216 - if it has been reverted we should determine when/how that happened. This happened long before win7 or win7_64 existed, so it is a regression in our machine configuration.
Attachment #484797 - Attachment mime type: text/x-c++src → text/plain
Attachment #484800 - Attachment mime type: text/x-c++src → text/plain
Dustin and I cooked up a plan in IRC to address this. We have had no luck getting this patch upstreamed into Twisted. We will have much more luck getting a patch upstreamed into buildbot. So, we are going to try to fix this with a patch to the buildbot slave code itself. The fix will be to essentially add in the code described in comment 1 into the buildbot slave code that launches commands on windows. Even if we were to fail to upstream the patch into buildbot, we at least have a mechanism for rolling out custom changes to buildbot,whereas we do not have such a mechanism for twisted. So, this is simply all together better. I'm hoping to be working on this after the all hands next week.
Sounds fantastic :)
(In reply to comment #0) > So, we learned from Alice that the twistd instance on all the win32 talos > slaves has the _dumbwin32proc.py twistd file patched to include the code > attached in the pastebin. So I learned recently, that mozilla pastebin doesn't "Save Forever" even when you select that. This was set to only a month in all cases. This due to the fact that some spammers had taken to pastebin "forever" saving :/ So any chance you can resurrect what you had here and/or confirm that http://mxr.mozilla.org/build/source/opsi-package-sources/twisted_dumbwin32proc/CLIENT_DATA/_dumbwin32proc.py is actually the new file?
(In reply to comment #22) > > So any chance you can resurrect what you had here and/or confirm that > http://mxr.mozilla.org/build/source/opsi-package-sources/twisted_dumbwin32proc/CLIENT_DATA/_dumbwin32proc.py > is actually the new file? Callek that is the file that got deployed recently to XP slaves. Here is the diff to what there was on the slaves VS what there is now: https://bug614955.bugzilla.mozilla.org/attachment.cgi?id=499149 I can't say what the pastebin was but I believe from comment 0 that the only changes that ctalbert wanted were to change this line: http://mxr.mozilla.org/build/source/opsi-package-sources/twisted_dumbwin32proc/CLIENT_DATA/_dumbwin32proc.py#173
Clint, Jeff, any progress here? I'd like to deploy the new buildslave to the windows machines shortly, so I need something that won't break test executions.
So to recap, the current twistd patch does not allow breakaway jobs on windows. This results in harnesses (or other software) potentially error-ing if they try to create a job object. However, it does (loosely) guarantee that twisted can clean up all jobs -- if breakaways are allowed, the process that spawns breakaways must clean up after them. Presumably, the only reason the subprocess would spawn a job object is if it wanted/needed to clean up after it itself. This is the case for mozmill, which AFAIK is the only case where this is currently needed in the wild, although if the mozprocess module (https://github.com/mozautomation/mozmill/tree/master/mozprocess) is used by other harnesses for more robust process handling, then mozmill will no longer be a singularity here. If mozprocess cannot create a breakaway job, it punts and "assumes" that it is in a job object that will be correctly cleaned up. Its been awhile since I visited this issue, so my memory may be hazy. The last I recall, there was no known problem with mozmill or twisted failing cleanup. This could actually be wrong, as there was some issue that I only saw proverbially out of the corner of my eye but I never quite understood and mozmill was shut off I think before anyone really understood what was going on. But I'm a little fuzzy on this. Does anyone remember any better? Assuming this is not a problem, currently, my current recommendation is to keep this bug open but not to particularly prioritize it until either it becomes necessary or someone has time to work on it. :ctalbert, however, would have a better recommendation and he will be back in town next week. Currently, while the patch is not hard to write, it is fairly difficult and time-consuming to test. The autotools staging environment is not currently working, so that makes it that much the worse. :dustin, if you wanted to take over this bug and have an easier time testing than us, you'd be more than welcome. Otherwise, I'm not sure if this is a priority now, though IMHO it is something that should be done. :ctalbert may have more details on his return, unsure
Well, it's a priority because I'll be deploying a new version of Buildbot, replete with its own virtualenv and a new version of Twisted, within the next week or two. I suppose my best course is to try to re-apply the existing patch to Twisted-11.0.0. I don't have a great way to test this, but I can put it in dev/pp for a while and run some mozmill tests against it.
Here's how to test to see if the patch is working correctly: 1) install patch 2) run talos tp test through buildbot 3) go to slave that is actively running tp in a browser window, stop buildbot slave 4) if the running browser window is killed, then it is working correctly. if the browser window stays up, then buildbot did not kill all the spawned talos processes.
Oh, and you'll want to do that test on a windows slave - that's where the job object management happens.
From IRC: - mozmill is currently off - mozmill is probably better at handling jobs than _dumbwin32proc.py - if this is still broken, we should make a patch for buildbot Based on that I'm going to deploy a "stock" Twisted-11.0.0 for now. So this bug is no longer a priority for me.
The issue being that talos does not manage jobs. We require the _dumbwin32proc.py patch so that we don't get talos slaves stuck with open browser windows when something bad happens. I don't believe that a stock twisted deployment will handle the job objects for us, and will results in stuck test slaves.
I believe alice is correct. I don't think stock twisted will correctly work, so *some* patch needs to be applied. This is somewhat an aside from the bug's subject matter
No longer blocks: 637347
I'm not working on this ATM, so resetting assignee
Assignee: jhammel → nobody
Severity: major → S2
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: