Closed Bug 712102 Opened 13 years ago Closed 13 years ago

Mozrunner hangs when registering addons for mozilla 1.9.2

Categories

(Testing :: Mozbase, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Unassigned)

References

Details

Attachments

(3 files)

From bug 700415 comment 51:

On w7:

14:09:25     INFO -    File "C:\mozilla-build\python25\lib\tempfile.py", line 33, in <module>
14:09:25     INFO -      from random import Random as _Random
14:09:25     INFO -    File "C:\mozilla-build\python25\lib\random.py", line 838, in <module>
14:09:25     INFO -      _inst = Random()
14:09:25     INFO -    File "C:\mozilla-build\python25\lib\random.py", line 94, in __init__
14:09:25     INFO -      self.seed(x)
14:09:25     INFO -    File "C:\mozilla-build\python25\lib\random.py", line 108, in seed
14:09:25     INFO -      a = long(_hexlify(_urandom(16)), 16)
14:09:25     INFO -  WindowsError: [Error 22] Invalid Signature

(Random error when creating the mozprofile.)

This doesn't always happen.

From bug 700415 comment 53:

w7 and xp: I've stopped hitting the WindowsError and just end up hanging and timing out after 1200 seconds with no output during the peptest run.

The only way to rescue the slave after this hang is to reboot.
Depends on: 712072
I loaned ahal a Windows 7 box that I've been using for testing (talos-r3-w7-003).
So I managed to reproduce this on my Windows environment at home. Oddly mcote was also able to reproduce on his Linux environment (that used to work for him).

After doing a pdb trace this is the line that is hanging in mozrunner: https://github.com/mozilla/mozbase/blob/master/mozrunner/mozrunner/runner.py#L164

That code is only there to maintain compatibility with Firefox 3.6.

Given that you used to be hitting the python random lib and then magically stopped seeing it (and that mcote and I are both now able to reproduce without having been able to in the past) I might guess that there was a change in Firefox.
So with some help from mcote we've determined:
a) Still works in Aurora (i.e regression happened in Firefox)
b) The problem goes away if we uncomment the above line.

That code is only there to maintain compatibility with Firefox 3.6 so one possible solution is to drop support for 3.6.
There are only three consumers of mozrunner that run against Gecko 1.9 (Firefox 3.6) - one is the thunderbird tree (which I list because I'm not certain about) the others that I am more certain about are the Jetpack cfx tool and the Mozmill Testing system.

All three of these tools are running using the 1.5.x branch of the mozrunner API,and will need to upgrade to modern versions of mozrunner at some point.  So, I'm OK with breaking 3.6 compatibility.  Rev the version number when you and make it very clear in your commit message as to what the breakage will be on 1.9 and that this is a deliberate move away from it.
Where is the stack trace in comment 0 coming from?
> w7 and xp: I've stopped hitting the WindowsError and just end up hanging and timing out after
> 1200 seconds with no output during the peptest run.

That's a separate issue that I haven't been able to reproduce.
(In reply to Ted Mielczarek [:ted, :luser] from comment #5)
> Where is the stack trace in comment 0 coming from?

Not entirely sure.
Catlee was worried about the new signing-on-demand, but I think I was able to reproduce with an Aurora build before those were signed on demand.

Then it went away :\

I guess phantom errors that are gone are better than phantom errors that stick around.

(In reply to Andrew Halberstadt [:ahal] from comment #6)
> > w7 and xp: I've stopped hitting the WindowsError and just end up hanging and timing out after
> > 1200 seconds with no output during the peptest run.
> 
> That's a separate issue that I haven't been able to reproduce.

The timeout is buildbot-specific; if there's no output for 20min (configurable, but this is probably symptomatic of some larger issue), it will kill the job.
Component: Peptest → Mozbase
OS: Windows 7 → All
QA Contact: peptest → mozbase
Hardware: x86 → All
Summary: Peptest doesn't install or hangs on Windows → Mozrunner hangs when registering addons for mozilla 1.9.2
I version bumped this to 5.1 and can upload to pypi when it lands.
Attachment #583262 - Flags: review?(jhammel)
Comment on attachment 583262 [details] [diff] [review]
Patch 1.0 - Remove un-needed addon registration

this is kinda blind, but

remember to bump the mozmill master version requirement to reflect this
Attachment #583262 - Flags: review?(jhammel) → review+
Not sure if this is fixed for testing, but I'm still hanging on windows using http://people.mozilla.com/~ahalberstadt/firefox-11.0a1.en-US.mac.tests.zip .
So the above patch landed and should be in the tests.zip https://github.com/mozilla/mozbase/commit/d09cb9d00db58f1f39d63185b278e99bf247ccdc. This patch fixes the hang I saw on Monday.

When mcote and I reproduced this we saw the line "DEBUG Starting Peptest" before the hang actually occured. I'm not seeing this in the test slave logs. So maybe this is a separate hang? Ay ca-rumba.
(In reply to Andrew Halberstadt [:ahal] from comment #11)
> So the above patch landed and should be in the tests.zip
> https://github.com/mozilla/mozbase/commit/
> d09cb9d00db58f1f39d63185b278e99bf247ccdc. This patch fixes the hang I saw on
> Monday.
> 
> When mcote and I reproduced this we saw the line "DEBUG Starting Peptest"
> before the hang actually occured. I'm not seeing this in the test slave
> logs. So maybe this is a separate hang? Ay ca-rumba.

Its probably worth noting this change in the mozrunner README.md
C:\Documents and Settings\cltbld>c:\\talos-slave\\test\\build\\venv\\Scripts\\py
thon c:\\talos-slave\\test\\build\\tests\\peptest\\peptest\\runpeptests.py --bin
ary c:\\talos-slave\\test\\build\\application\\firefox\\firefox.exe --test-path
c:\\talos-slave\\test\\build\\tests\\peptest\\tests/firefox/firefox_all.ini
=ENTERING MAIN=                                                    
=ENTERING PEPTEST INIT=
creating runner                                                                 
=EXITING PEPTEST INIT=                                                          
=STARTING PEPTEST=                                                             
=ENTERING MOZRUNNER START=                                                 
=ENTERING MOZRUNNER KILL=
running process        
ProcessManager UNABLE to use job objects to manage child processes
ProcessManager NOT managing child processes
PEP ERROR | AttributeError: 'Process' object has no attribute '_procmgrthread'
Exception exceptions.AttributeError: "'Process' object has no attribute '_intern
al_poll'" in <bound method Process.__del__ of <mozprocess.processhandler.Process
 object at 0x011A4B70>> ignored
=ENTERING MOZRUNNER KILL=                                         
Exception exceptions.AttributeError: "'PepProcess' object has no attribute 'proc
'" in <bound method FirefoxRunner.cleanup of <mozrunner.runner.FirefoxRunner obj
ect at 0x011A4A90>> ignored
So I changed the mozharness_python to use -u, and that didn't help the buffering issue.

When I tried running the above command (comment 13) on the commandline, that's the output I got.  I'm not sure why I wasn't able to get that from mozharness; it's entirely possible that's a bug, though I wasn't able to figure it out.

Hoping that output is useful?

Ahal has the login info for talos-r3-w7-003 via email, which should allow him to do some testing via vnc or ssh.

If it's still not solved by EOW, maybe we need someone else to get access to buildvpn to try.
Attached patch WIP fixSplinter Review
So this patch to mozprocess fixes the hang for me. Two things to note: 
1) There's still an exception (Attribute error) on shutdown when calling self._job.Close()
2) I'm not sure if this patch introduces memory leaks with the python ctypes stuff

I still don't really know what is going on (I just made this patch based on the stack traces I got). Clint did a whole bunch of debugging, I feel like he can probably shed some more light on what's happening then I can.
So, there were a host of issues here.
1. We weren't really good on python 2.5.  We were calling some stuff that didn't exist in 2.5 and that was causing us some problems.
2. There is an oddity with None != NULL in python ctypes on python 2.5 (I assume) and therefore we could not create the job object, but the code is written to fallback gracefully when it cannot create job objects, but because we threw an exception early due to the 2.5 issue, we did not completely fallback as the code originally intended.  That means we were in an inbetween state half-expecting job objects to work and half expecting them not to work, causing issues.

The changes we've made address these things.
Attachment #583949 - Flags: review?(ahalberstadt)
And I don't get any errors running peptest on windows with 2.6.5 and this patch.
Comment on attachment 583949 [details] [diff] [review]
this works on windows on the buildbot slave!

Review of attachment 583949 [details] [diff] [review]:
-----------------------------------------------------------------

Awesome! Thanks so much for helping debug this.
So to recap next steps we need to land this in master, mirror master to m-c, test and clean up the mozharness code, get everything reviewed, deploy to try.
Attachment #583949 - Flags: review?(ahalberstadt) → review+
(In reply to Andrew Halberstadt [:ahal] from comment #18)
> Comment on attachment 583949 [details] [diff] [review]
> this works on windows on the buildbot slave!
> 
> Review of attachment 583949 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Awesome! Thanks so much for helping debug this.
> So to recap next steps we need to land this in master, mirror master to m-c,
> test and clean up the mozharness code, get everything reviewed, deploy to
> try.

Please go ahead and land on master, then give me a patch to mirror to m-c and I can review and land that this weekend.  Thanks for all the hard work on this and have a wonderful christmas and an excellent new year.

-- C
master: https://github.com/mozilla/mozbase/commit/df008c27b6a83c8ccd04302f3fa0471b12254c89

I'll file a new bug for getting this into m-c since there are a whole bunch of patches that need to get in.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: