Closed Bug 639879 Opened 13 years ago Closed 13 years ago

Hang during or after test-api-utils.testAddRemoveBackingArray when run on win7 under buildbot

Categories

(Add-on SDK Graveyard :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cmtalbert, Assigned: cmtalbert)

References

Details

Attachments

(2 files)

When running on a windows 7 slave, the test test-collection.testAddRemoveBackingArray hangs, causing the test timeout exception (300s) to fire.  

This exception is caught by the underlying framework, which correctly attempts to kill the browser.  The underlying framework sends a TerminateJobObject to windows, windows reports that the operation succeeded, but Firefox remains in the process list.

You can see this with an enormous amount of logging in the attached buildlog.  The important parts are here (The lines starting with # are my commentary):

.info: ++++ Test Pass: test-collection.testAddRemoveBackingArray
# This is the test we're currently running ^

++++ runner.py: Attempting runner.stop, caught exception
# Caught the exception ^

++++ mozrunner_init: stop called, calling kill
# going to kill the process due to the exception ^

++++ mozrunner_init: Win32 kill code called
++++ killableprocess:kill has been called group is True
++++ killableprocess:kill attempting TerminateJobObject
++++ winprocess.py: ErrCheckBool func is <WinFunctionType object at 0x01E24918> with result 1
# ^ TerminateJobObject returns a nonzero value, which indicates success.

++++ killableprocess: setting return code to 127
++++ runner.py: Attempting to remove profile now
++++ runner.py: Calling profile.cleanup()
++++ __init__:cleanup profile called
++++ __init__: attempting rmtree
++++ __init__: Attempting to display all processes
# ^ I had our profile cleanup code enumerate and display currently running processes on the slave
Dwm.exe
Explorer.EXE
taskhost.exe
taskeng.exe
cmd.exe
conhost.exe
RtHDVCpl.exe
Bootcamp.exe
ApacheMonitor.exe
python.exe
cmd.exe
bash.exe
bash.exe
python.exe
firefox.exe
# ^ Look at that @#$%^! still running!

++++ __init__: Ending display of processes
+++ __init__: oncleanup_error: [Error 13] The process cannot access the file because it is being used by another process: 'c:\\users\\cltbld\\appdata\\local\\temp\\tmpydp1zr.mozrunner\\Cache\\_CACHE_001_'
# ^ And now we get those "unable to remove profile" errors because firefox is still running, tying up the files in the profile directory.

I think that the test is causing some kind of unusual hang in Firefox and is breaking our underlying framework.  I'll keep looking to see if there is anything else I can do in the windows API to kill off an unresponsive app, but at the moment, TerminateJobObject is pretty damn drastic.  It says "cease and desist now" to the controlling process and any subprocesses the process spawned.  The underlying processes are not supposed to be able to defer, postpone or weasel out of termination in any way.

See: http://msdn.microsoft.com/en-us/library/ms686709%28VS.85%29.aspx
and: http://msdn.microsoft.com/en-us/library/ms686714%28v=VS.85%29.aspx
There is yet another page about TerminateProcess:
http://msdn.microsoft.com/en-us/library/aa450927.aspx
With some more details:
> Terminating a process does not necessarily remove the process object from the system.
> A process object is deleted when the last handle to the process is closed. 

So all I/O for logging or sqlite or whatever may lead to such situation where firefox keeps closing them, and taking time to do so.

If you search on this source code:
http://src.chromium.org/svn/trunk/src/base/process_util_win.cc
The "KillProcess" function seems to wait for "real" process end.
Right, I read that this morning as well, and had Lukas push another change to buildbot for me.  The change I made was that when mozrunner does a kill, it also does a wait afterward to ensure that the process has terminated before returning control to the calling entity

By doing the wait after the kill, it does ensure that the process has ended and it cleans up the profile properly.  So that is good.  But the test is still causing an exception which is still something that should be addressed (especially since this test only throws this exception when run in the buildbot framework).
Blocks: 640047
This is the fix to the harness to deal with exploding tests properly.

It doesn't fix the exploding test at all.  This patches the mozrunner code directly inside of jetpack. I'll also attach a patch to upstream these changes into mozrunner upstream.  I've filed bug 640047 to look into using IOCompletion ports within mozrunner but that will be longer horizon work and will be done as a follow up to this.  This first patch solves the blocking immediate need so that jetpack can go live on buildbot.

Lukas, when you get a chance, please sendchange a new jetpack test run with the win7 slave AND with a winxp slave.  I have this patch uploaded to http://people.mozilla.org/~ctalbert/jetpack/tip.tar.bz2
Attachment #517946 - Flags: review?(jhammel)
Attachment #517946 - Flags: feedback?(myk)
I wrote this Jetpack test.  So if it were failing or hanging I'd want to fix it, but it's not clear to me from that log that it is.  The Cuddlefish runner is timing out for some reason, and it's happening after or during test-collection.testAddRemoveBackingArray, but why?  And what's with all the "The process cannot access the file because it is being used by another process" errors before the timeout?

Also, why does it look like the suite is being run twice, concurrently?  Is that related to the timeout?  The annotator test is the first to run, and then a little later it's run again.  Same for reading-data (grep for "test-main.testID").  Same for test-packaging.testPackaging and all the other tests.  And there are strange lines in the log like these:

.info: ++++ Test Pass: teinfo: ++++ Test Pass: test-api-utils.testValidateOptionsIs

.st-api-utils.testValidateOptionsIs

.info: ++++ Test Pass: test-app-strings.info: ++++ Test Pass: test-app-strings.testStringBundle
One other thing.  When I run the suite locally, test-commonjs-test-adapter.js runs after test-api-utils.js, so I assume that's the case with the buildbot too.  test-commonjs-test-adapter.js (via api-utils/lib/test.js) potentially puts the test runner into async mode.  test-api-utils.js does not, so it's not really possible for it to cause a timeout.

Irakli, do you know if test-commonjs-test-adapter.js could be causing this hang?
Summary: testAddRemoveBackingArray throws exception when run on win7 under buildbot → Hang during or after test-api-utils.testAddRemoveBackingArray when run on win7 under buildbot
I don't really understand why the addon-sdk has a forked copy of mozrunner.  What's different from the stock copy?
Comment on attachment 517946 [details] [diff] [review]
Fix for addon-sdk

If you're moving to shutil.rmtree, you should uniformly deprecate the other rmtree.

+                # On windows, it sometimes behooves one to wait for dust to settle
+                # after killing processes.  Let's try that.
+                # TODO: Bug 640047 is invesitgating the correct way to handle this case
+                self.process_handler.wait(timeout=10)

This should probably go in process_handler itself, i think, long term.  This version of mozrunner is so messy, however, whether its worth cleaning up and doing it correctly kinda depends on the long term plans.  If this fork is going to be maintained, it should probably be done correctly.  However, I would encourage pushing the important fixes upstream and then getting the newer, cleaner mozrunner.

+                # Uncomment this line below to discover very useful things about your environment
+                #print "++++ killableprocess: releng twistd patch not applied, we can create job objects"

Should....this go in?

+                while notdone == 1:

<nit> I usually do the do-while pattern in python something like
while 1:
  notdone = check()
  if notdone != 1:
    break
  sleep(0.5)
</nit>

+                            c_buffer, c_ulong, byref

It doesn't look like you use these new imports?

I'm not sure I'm qualified to know the nuances of windows job objects, but it looks good to me and I'll trust your judgement.  r=me, though you should cleanup the rmtree thing and maybe we should figure out in a separate bug a better strategy to maintaining this fork
Attachment #517946 - Flags: review?(jhammel) → review+
(In reply to comment #4)
> I wrote this Jetpack test.  So if it were failing or hanging I'd want to fix
> it, but it's not clear to me from that log that it is.  The Cuddlefish runner
> is timing out for some reason, and it's happening after or during
> test-collection.testAddRemoveBackingArray, but why?  And what's with all the
> "The process cannot access the file because it is being used by another
> process" errors before the timeout?
>
The "used by another process errors" are caused by a framework failure which is what my patch addresses.  The reason there is a framework error here is because the test - either this addRemoveBackingArray test is exploding or the test after it is exploding causing the framework to attempt to do a runner.stop().  That runner.stop() code fails to stop the launched application (in this case Firefox) and when we attempt to remove teh profile, we get "file in use" errors.  My patch addresses the issues with stopping the application properly so that this won't happen anymore.
 
> Also, why does it look like the suite is being run twice, concurrently?  
I don't know what's up with that.  The suite is being run by doing a cfx testall -a firefox -b <path to firefox>  from this bash file: http://mxr.mozilla.org/build/source/tools/buildfarm/utils/run_jetpack.sh
Is
> that related to the timeout?  The annotator test is the first to run, and then
> a little later it's run again.  Same for reading-data (grep for
> "test-main.testID").  Same for test-packaging.testPackaging and all the other
> tests.  And there are strange lines in the log like these:
> 
> .info: ++++ Test Pass: teinfo: ++++ Test Pass:
> test-api-utils.testValidateOptionsIs
This is from my instrumenting of the addon-sdk code so I could try to figure out what was going on.  Since I had to run this via buildbot to see a failure, the only way I knew to quickly instrument to see what test was running was to instrument in the /pass/fail/skip reporting.  So those are my 'print' statements (all the print statements you see in the attached log that start with '++++' are mine.

You do have a good point that I may have the wrong test.  If the test following addRemoveBackingArray blocks before it does any pass/fail/skip reporting, then my instrumentation won't capture that.  So, I'd turn an equal amount of skepticism on the test that comes after this one, especially if it puts things in async mode.  

Thanks for taking a look, sorry for the confusion.
(In reply to comment #7)
> Comment on attachment 517946 [details] [diff] [review]
> Fix for addon-sdk
> 
> If you're moving to shutil.rmtree, you should uniformly deprecate the other
> rmtree.
Maybe, since that wasn't the problem, I'm actually more inclined to go back to using the original rmtree implementation.  I really don't want to change more code than I have to here.
> 
> +                # On windows, it sometimes behooves one to wait for dust to
> settle
> +                # after killing processes.  Let's try that.
> +                # TODO: Bug 640047 is invesitgating the correct way to handle
> this case
> +                self.process_handler.wait(timeout=10)
> 
> This should probably go in process_handler itself, i think, long term.  This
> version of mozrunner is so messy, however, whether its worth cleaning up and
> doing it correctly kinda depends on the long term plans.  If this fork is going
> to be maintained, it should probably be done correctly.  However, I would
> encourage pushing the important fixes upstream and then getting the newer,
> cleaner mozrunner.
All of this is going to be upstreamed.  It just takes a while to do a good patch because so much has been refactored in this code already.  

RE putting in killableprocess.py -- I can't do that.  killableprocess.py:wait calls kill, so if killableprocess.py:kill calls wait, then we get ourselves into a loop that I'd rather not deal with.  I'd prefer this code to have a clear line of entry and exit as this killing process stuff is absolutely critical to the framework.
> 
> +                # Uncomment this line below to discover very useful things
> about your environment
> +                #print "++++ killableprocess: releng twistd patch not applied,
> we can create job objects"
> 
> Should....this go in?
You mean the commented out line? Or do you mean that we should always print out that diagnostic bit of information?  I don't really have an opinion.  It might be good to format this message properly and output that diagnostic bit of info, particularly since it's something of a crap shoot w.r.t. which slaves have the evil twisted patch and which don't.
> 
> +                while notdone == 1:
> 
> <nit> I usually do the do-while pattern in python something like
> while 1:
>   notdone = check()
>   if notdone != 1:
>     break
>   sleep(0.5)
> </nit>
> 
that looks cleaner, I'll do that.

> +                            c_buffer, c_ulong, byref
> 
> It doesn't look like you use these new imports?
ah yeah, I added them for all my debugging stuff that I wrote to enumerate the currently running processes.  I'll remove them now.
> 
> I'm not sure I'm qualified to know the nuances of windows job objects, but it
> looks good to me and I'll trust your judgement.  r=me, though you should
> cleanup the rmtree thing and maybe we should figure out in a separate bug a
> better strategy to maintaining this fork

I think the maintenance strategy is that we should do the following:
1. Check this in to jetpack to fix their issues (pending f+/a+ from myk)
2. Create a patch against the mozrunner 1.5.x codebase, land it so that we have it there in case we do another 1.5.x release (I hope we don't)
3. Create a patch against the mozrunner 2.0 codebase and land there so that we can upgrade all this code once 2.0 is ready.

Ideally the upgrade path for this fork is to 2.0.  But if a 1.5.3 is released, then that would be the next logical upgrade so this patch needs to be on both branches.
(In reply to comment #8)
> > .info: ++++ Test Pass: teinfo: ++++ Test Pass:
> > test-api-utils.testValidateOptionsIs
> This is from my instrumenting of the addon-sdk code so I could try to figure
> out what was going on.

I mean that this output is getting interleaved with itself, as if the suite is running twice concurrently, and the two runs are dumping output at the same time.  e.g., "teinfo" in the line quoted above.

> You do have a good point that I may have the wrong test.  If the test following
> addRemoveBackingArray blocks before it does any pass/fail/skip reporting, then
> my instrumentation won't capture that.

Is there a reason you can't dump the cfx output directly?  When you run it with -v, it prints lots of info like passes, failures, and the names of tests it's about to execute.  It would be helpful if the buildbot logs or whatever (I don't know how all this is set up) included all the cfx -v output.
(In reply to comment #9)
> (In reply to comment #7)
> > Comment on attachment 517946 [details] [diff] [review]
> > If you're moving to shutil.rmtree, you should uniformly deprecate the other
> > rmtree.
> Maybe, since that wasn't the problem, I'm actually more inclined to go back to
> using the original rmtree implementation.  I really don't want to change more
> code than I have to here.

Either way as fine, so long as we don't have two implementations running around.

> > +                # Uncomment this line below to discover very useful things
> > about your environment
> > +                #print "++++ killableprocess: releng twistd patch not applied,
> > we can create job objects"
> > 
> > Should....this go in?
> You mean the commented out line? Or do you mean that we should always print out
> that diagnostic bit of information?  I don't really have an opinion.  It might
> be good to format this message properly and output that diagnostic bit of info,
> particularly since it's something of a crap shoot w.r.t. which slaves have the
> evil twisted patch and which don't.

I guess, for the clean upstream implementation, i would have a method in e.g. the runner that determines this information and have it callable and logged/printed/reported/etc in the mozmill package since we do all of our gathering there.  


> > </nit>
> > 
> that looks cleaner, I'll do that.

I miss do-while in python but this seems about the closest.
 
> I think the maintenance strategy is that we should do the following:
> 1. Check this in to jetpack to fix their issues (pending f+/a+ from myk)
> 2. Create a patch against the mozrunner 1.5.x codebase, land it so that we have
> it there in case we do another 1.5.x release (I hope we don't)
> 3. Create a patch against the mozrunner 2.0 codebase and land there so that we
> can upgrade all this code once 2.0 is ready.
> 
> Ideally the upgrade path for this fork is to 2.0.  But if a 1.5.3 is released,
> then that would be the next logical upgrade so this patch needs to be on both
> branches.

Sounds like a good plan
Comment on attachment 517946 [details] [diff] [review]
Fix for addon-sdk

This seems fine to me!


> I mean that this output is getting interleaved with itself, as if the suite is
> running twice concurrently, and the two runs are dumping output at the same
> time.  e.g., "teinfo" in the line quoted above.

I wonder if the e10s stuff has anything to do with this.  I think the test harness runs tests twice, once in "e10s mode".  Perhaps the second run is starting before the first run finishes?
Attachment #517946 - Flags: feedback?(myk) → feedback+
Clint: As an aside, I'm not familiar with the current status and plans for mozrunner, so perhaps we could get together with new Jetpack PM Dave Mason this week to get up to speed and learn how we can best maintain this dependency?
Keywords: checkin-needed
(In reply to comment #13)
> Clint: As an aside, I'm not familiar with the current status and plans for
> mozrunner, so perhaps we could get together with new Jetpack PM Dave Mason this
> week to get up to speed and learn how we can best maintain this dependency?

Yes, let's do that.  For landing this in jetpack, do I check into the git repo or the hg repo?  Do I even have permissions to check in or do I need to get someone to land this for me?
(In reply to comment #14)
> Yes, let's do that.

Ok, I'll schedule something by separate mail.


> For landing this in jetpack, do I check into the git repo
> or the hg repo?  Do I even have permissions to check in or do I need to get
> someone to land this for me?

You check into the git repo, and I can give you permissions; just let me know your GitHub username.  Alternately, I can land this for you if you don't have a GitHub account or would rather I do it.
Landed so that it makes it in before the freeze for SDK 1.0b4:

https://github.com/mozilla/addon-sdk/commit/dd71a866ca9fe55a89e348be7248a8e180dd524a
Assignee: nobody → ctalbert
Status: NEW → RESOLVED
Closed: 13 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Blocks: 648523
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: