Closed Bug 956315 Opened 10 years ago Closed 10 years ago

JSBridge server cannot be started after a restart of Firefox because it is still in use

Categories

(Testing Graveyard :: Mozmill, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [mozmill-2.0.4+])

Attachments

(3 files, 3 obsolete files)

Since we have upgraded our CI to Mozmill 2.0 we see a lot of disconnect errors due to a broken JSBridge connection. I tried to reproduce such a condition and was successful on OS X. Here the last lines of the debug log:

* Sending message: '{"eventType":"mozmill.pass","result":{"function":"assert.waitFor()"}}'
* Set: '6dd53978-7489-11e3-bf4e-3c07540e854b'
* Sending message: '{"result":true,"data":"bridge.registry[\"{27c051a3-c4a7-1446-9ea5-2201b22a307e}\"]","uuid":"6dd53978-7489-11e3-bf4e-3c07540e854b"}'
* Observer topic: 'quit-application'
* Stop JSBridge server on port: '60022'
* Observer topic: 'profile-after-change'
TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed
RESULTS | Passed: 2
RESULTS | Failed: 0
RESULTS | Skipped: 3
Traceback (most recent call last):
  File "/Users/mozauto/jenkins/workspace/project_functional/mozmill-env-mac/python-lib/mozmill/__init__.py", line 782, in run
    mozmill.run(tests, self.options.restart)
  File "/Users/mozauto/jenkins/workspace/project_functional/mozmill-env-mac/python-lib/mozmill/__init__.py", line 409, in run
    frame = self.run_test_file(frame or self.start_runner(),
  File "/Users/mozauto/jenkins/workspace/project_functional/mozmill-env-mac/python-lib/mozmill/__init__.py", line 326, in start_runner
    self.create_network()
  File "/Users/mozauto/jenkins/workspace/project_functional/mozmill-env-mac/python-lib/mozmill/__init__.py", line 287, in create_network
    self.jsbridge_port)
  File "/Users/mozauto/jenkins/workspace/project_functional/mozmill-env-mac/python-lib/jsbridge/__init__.py", line 44, in wait_and_create_network
    raise Exception("Cannot connect to jsbridge extension, port %s" % port)
Exception: Cannot connect to jsbridge extension, port 60022
It looks like that I can reproduce it way better when running Mozmill in --debug mode. As what the lines above show me is that somehow jsbridge doesn't correctly setup the bridge after the 'profile-after-change' observer topic. Usually we should see lines like those:

* Observer topic: 'quit-application'
* Stop JSBridge server on port: '60742'
* Observer topic: 'profile-after-change'
* Start JSBridge server on port: '60742'
* Register: '8286bae1-748c-11e3-8889-3c07540e854b (backchannel)'
This bug will require us to ship a Mozmill 2.0.4 release.
Whiteboard: [mozmill-2.0.4+]
Given that we fail during startup I tried to get the browser console log, which hopefully could show us what's going on here. It has been taken a while to get this problem reproduced but here we have it:

Could not read chrome manifest 'file:///Users/mozauto/jenkins/workspace/project_functional/FirefoxHolly.app/Contents/MacOS/browser/extensions/%7B972ce4c6-7e08-4474-a285-3208198ce6fd%7D/chrome.manifest'.

While creating services from category 'profile-after-change', could not create service for entry 'Disk Space Watcher Service', contract ID '@mozilla.org/toolkit/disk-space-watcher;1'
[Exception... "'Error: Socket failed to bind, kill all firefox processes' when calling method: [nsIObserver::observe]"  nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0"  data: no]
missing : after property id handlers.js:61

While creating services from category 'profile-after-change', could not create service for entry 'MozmillHandlers', contract ID '@mozilla.org/mozmill/handlers;1'

As it looks like something is wrong with creating any kind of service after the 'profile-after-change' topic. Sadly we do not get any further details for our case, but it might be that the code for 'Disk Space Watcher Service' might block us here. Interestingly there is no line for our jsbridge handler. I will try once more if it was my fault in copy and paste.
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Could not read chrome manifest
> 'file:///Users/mozauto/jenkins/workspace/project_functional/FirefoxHolly.app/
> Contents/MacOS/browser/extensions/%7B972ce4c6-7e08-4474-a285-3208198ce6fd%7D/
> chrome.manifest'.

This is irrelevant. It's the default theme.

> While creating services from category 'profile-after-change', could not
> create service for entry 'Disk Space Watcher Service', contract ID
> '@mozilla.org/toolkit/disk-space-watcher;1'

I can also see this outside of Mozmill and it doesn't cause issues. So I assume this is also something different.

> [Exception... "'Error: Socket failed to bind, kill all firefox processes'
> when calling method: [nsIObserver::observe]"  nsresult: "0x8057001c
> (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "native frame :: <unknown
> filename> :: <TOP_LEVEL> :: line 0"  data: no]
> missing : after property id handlers.js:61
> 
> While creating services from category 'profile-after-change', could not
> create service for entry 'MozmillHandlers', contract ID
> '@mozilla.org/mozmill/handlers;1'

Interestingly here we have our problem. The exception as shown above is not related to the disk space watcher but Mozmill, which registers its component via handlers.js. The line which is failing here is:

var errorRegEx = /\[.*(Error|Exception).*(chrome|resource):\/\/(mozmill|jsbridge).*/i;

I'm absolutely not sure what the problem is here, and why it assumes a property here.
Depends on: 956426
Well, the above failure is a different issue given that the jsbridge component is independent from mozmill and should even be registered if mozmill fails. So I will fix that in bug 956426, which is necessary to show us the real underlying issue here.

So the real jsbridge issue here is: 'Error: Socket failed to bind, kill all firefox processes'. I will have to figure out why it is happening, but it could be somewhat related to mozprocess returning too early as reported on bug 947248.
This failure can happen already after the first test inside a restart test as seen here:

TEST-START | testAddons_enableDisableExtension/test1.js | setupModule
TEST-START | testAddons_enableDisableExtension/test1.js | testInstallAddon
TEST-PASS | testAddons_enableDisableExtension/test1.js | testInstallAddon
TEST-START | testAddons_enableDisableExtension/test1.js | teardownModule
TEST-END | testAddons_enableDisableExtension/test1.js | finished in 2772ms
TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed
Summary: Exception: Cannot connect to jsbridge extension, port XYZ → Exception: Cannot connect to jsbridge extension, port XYZ due to 'Error: Socket failed to bind, kill all firefox processes'
So I think the problem here is that we do not correctly wait for the former bridge connection to be closed. This is necessary because with a restart of Firefox the application keeps its process id. Calling self.runner.wait() would timeout so we might have to explicitly wait for self.bridge.connected to be false, or force disconnecting.

If the above is the case I will have to check which of those two methods will be more appropriate.

This bug might also be the case why bug 915563 exists, so I will mark it as blocking.
Blocks: 915563
I was wrong with my assumption. In cases of this failure the connection of the bridge is closed. Further investigation has shown that the problem is located here:

https://github.com/mozilla/mozmill/blob/master/jsbridge/jsbridge/extension/resource/modules/Sockets.jsm#L121

The hang we see here is caused by NSS's NSS.Sockets.PR_Bind() method which doesn't return:
https://github.com/mozilla/mozmill/blob/master/jsbridge/jsbridge/extension/resource/modules/NSS.jsm#L84

So for now it could be a change in NSS itself, or in ctypes for Firefox 28. 27 beta builds are not affected by this as tests have been shown. Given that the issue is not reproducible it's hard to track down.

I'm cc'ing some folks with NSS and ctypes knowledge who might be able to help us.
Interestingly on Nov 22nd we upgraded NSPR to version 4.10.3beta1 on bug 942207. This might have caused this problem. I will run some tests around this date.
It's unlikely that the above bug is the reason for this issue. The regression on the holly branch is between Dec 4th and Dec 5th: http://hg.mozilla.org/projects/holly/pushloghtml?fromchange=347eaf441e0b&tochange=aca9247177d8

I also tried to run those tests for tinderbox builds but not sure why but I don't see the failure for those builds. The next step I will do is to run tests for Nightly builds and check if the regression range matches.
Brian, is there a way use the NSPR logger to get more details why we are hanging here? Or any other kind of logger?
Flags: needinfo?(brian)
There were only three changes in that NSPR update, none of them look very likely to be the root cause here, but you could certainly test.
Strangly for Nightly build the regression range seems to be between Dec 18th and 19th. Not sure if tinderbox builds will help here but I will try.

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=862cb6a1cc88&tochange=5c7fa2bfea8b
I was able to reproduce the failure with a tinderbox build so after thousands of tests ran, the regression for Nightly is somewhat between those two changesets:

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=940c52e93281&tochange=04a70c8908de

Nothing in here really stands out. :(

I think I will have to find a developer box and get gdb to retrieve the stack for such a failure, unless someone can help out in how to log NSPR code.
(In reply to Henrik Skupin (:whimboo) from comment #14)
> I think I will have to find a developer box and get gdb to retrieve the
> stack for such a failure, unless someone can help out in how to log NSPR
> code.

Jason, do you know?
Flags: needinfo?(jduell.mcbugs)
I assume the Socket Thread is the right one to look at. Here the gdb output of an opt build which I was lucky to catch for now.

(gdb) info threads
  Id   Target Id         Frame 
  33   Thread 0x7fa2efe06700 (LWP 32246) "Gecko_IOThread" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  32   Thread 0x7fa2e86ff700 (LWP 32247) "Socket Thread" 0x00007fa2fc52af7d in poll () at ../sysdeps/unix/syscall-template.S:81
(gdb) thread 32
[Switching to thread 32 (Thread 0x7fa2e86ff700 (LWP 32247))]
#0  0x00007fa2fc52af7d in poll () at ../sysdeps/unix/syscall-template.S:81
81	in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x00007fa2fc52af7d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fa2fc0ec198 in _pr_poll_with_poll () from /mozilla/bin/nightly/libnspr4.so
#2  0x00007fa2f78c8338 in nsSocketTransportService::Poll(bool, unsigned int*) () from /mozilla/bin/nightly/libxul.so
#3  0x00007fa2f78c852a in nsSocketTransportService::DoPollIteration(bool) () from /mozilla/bin/nightly/libxul.so
#4  0x00007fa2f78c897d in nsSocketTransportService::Run() () from /mozilla/bin/nightly/libxul.so
#5  0x00007fa2f788c5e5 in nsThread::ProcessNextEvent(bool, bool*) () from /mozilla/bin/nightly/libxul.so
#6  0x00007fa2f783a02e in NS_ProcessNextEvent(nsIThread*, bool) () from /mozilla/bin/nightly/libxul.so
#7  0x00007fa2f7a572be in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) () from /mozilla/bin/nightly/libxul.so
#8  0x00007fa2f7a4ac14 in MessageLoop::Run() () from /mozilla/bin/nightly/libxul.so
#9  0x00007fa2f788e2b8 in nsThread::ThreadFunc(void*) () from /mozilla/bin/nightly/libxul.so
#10 0x00007fa2fc0ed6a6 in _pt_root () from /mozilla/bin/nightly/libnspr4.so
#11 0x00007fa2fd436f6e in start_thread (arg=0x7fa2e86ff700) at pthread_create.c:311
#12 0x00007fa2fc5379cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

So it looks like we are hanging somewhere in the following method?

http://hg.mozilla.org/mozilla-central/annotate/cf2d1bd796ea/nsprpub/pr/src/pthreads/ptio.c#l3757
I run the tests for a couple of hours now with debug tinderbox builds but none of them on OS X and Linux have shown this problem. It seems to only happen for opt builds.

I think that I will file the appropriate NSPR bug tomorrow given that I'm fairly sure it is not our fault.
I was wrong with debug builds on OS X. I have seen a hang there but didn't notice that it is actually this problem. So it is always reproducible after the first restart of Firefox. The exception which gets thrown is:

* Call to xpconnect wrapped JSObject produced this error:  *
[Exception... "'Error: Socket failed to bind, kill all firefox processes' when calling method: [nsIObserver::observe]"  nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0"  data: no]
Running debug builds on OSX I have reproduced this failure with a 27 debug build (Oct 2013), a 24 debug build (June 2013) and a 20 debug build (January 2013).

We should check if it is a mozmill regression.
So on OS X PR_Bind returns and doesn't infinitely. With that I get a return value of -5982 which means:

/* Local Network address is in use */
#define PR_ADDRESS_IN_USE_ERROR                  (-5982L)

That means that the socket created before the restart is still in use and hasn't been closed correctly. So it would mean that we have two different things to investigate and fix:

1. Fix Mozmill to correctly close the socket
2. Fix Firefox to return from PR_Bind() if the requested socket is in use
Weird that PR_Bind doesn't return if the socket is in use.

I've never actually logged NSPR's own code before, but it looks possible: try setting 

  NSPR_LOG_MODULES=timestamp,io:5,nsSocketTransport:5

  http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/misc/prinit.c#155

The 'io' log module is used by some of the NSPR socket functions (but note that SocketBind() doesn't have any PR_LOG statements in it).
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(brian)
Attached file nspr.log
So this is the generated logfile given the params above. Not sure if you can read anything out of it yet but I will continue later to get more details. This log is from OS X where we do not hang in PR_Bind() and where I can reproduce it with debug builds constantly. I might try to add some log statements to PR_Bind later.
Flags: needinfo?(jduell.mcbugs)
Interestingly what I miss after the restart is the line 'creating nsSocketTransport'. It looks like that we don't create the transport at all. Is that dependent on something? How does Firefox behave when you restart the application from within Firefox? Could something survive the restart?
Attachment #8358255 - Attachment mime type: text/x-log → text/plain
Ok, I think I know now what's going on here. Actually it doesn't seem to be an NSPR bug. Instead something prevents Firefox from completely shutting down and a zombie process remains:

501 89263 89255   0  2:27PM ttys000    0:00.00 (firefox)
501 89270     1   0  2:27PM ttys000    0:00.02 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox
 
I will have to take a look at this process. Hopefully I can find something which will help us further. At least that explains why the socket at the given port cannot be used again.
Flags: needinfo?(jduell.mcbugs)
Attached patch WIP (obsolete) — Splinter Review
Ok, so the problem is indeed in Mozmill. The client socket connections we setup in Python to connect to the socket server spawned in Gecko, are causing this behavior. During a restart the clients are getting disconnected from the server, but somehow they still cause the socket to be open. That prevents the next Firefox instance from accessing it.

As solution we could close the connections earlier, so we can make sure the Firefox process can shutdown sanely. I have a WIP here, which really touches core shutdown code and is kinda risky. A final patch would need a really deep testing.

For now I would like to know if that patch works. Andrei, can you please have a look at it? Thanks
Attachment #8358372 - Flags: feedback?(andrei.eftimie)
Attached patch WIP 1.1 (obsolete) — Splinter Review
Better approach which only calls close() on both clients for a restart but not a normal shutdown. Also it applies cleanly now.
Attachment #8358372 - Attachment is obsolete: true
Attachment #8358372 - Flags: feedback?(andrei.eftimie)
Attachment #8358376 - Flags: feedback?(andrei.eftimie)
Comment on attachment 8358376 [details] [diff] [review]
WIP 1.1

Review of attachment 8358376 [details] [diff] [review]:
-----------------------------------------------------------------

This fixes the hang on restartAplication(), but I see it every time we call stopApplication()
If nothing else follows the stopApplication call this finishes nicely: https://pastebin.mozilla.org/3982110
But when we attempt to restart Firefox it hangs the same way: https://pastebin.mozilla.org/3982109

Besides this I've also noticed
> Socket Error: [Errno 54] ECONNRESET
intermittently.

We're probably on the right track with fixing this but still have some issues.
Attachment #8358376 - Flags: feedback?(andrei.eftimie) → feedback-
Jason, could you still please comment on comment 23? I'm not sure why it works when we close the connection from our clients earlier instead of letting this do the server on shutdown. The latter doesn't work.
Flags: needinfo?(jduell.mcbugs)
It's nothing which will fix this issue but helps us a lot in investigation.
Attachment #8358507 - Flags: review?(ctalbert)
Attachment #8358507 - Flags: review?(ctalbert) → review+
I talked with Clint on Friday about this problem and he agreed on my proposed steps, which I will do next:

* Using Wireshark to observe the creation and destruction of sockets and connections
* Using a process monitor to observe why we are getting zombie processes during restarts of Firefox

If none of the above gives us more insights I will follow the proposed WIP by closing connections earlier.
Depends on: 959551
Just a quick update regarding steps to reproduce:

1. Clone the mozmill repo from https://github.com/mozilla/mozmill
2. Create a new venv and run ./setup_development.sh
3. Run "mozmill -b FirefoxNightlyDebug.app/ -m mutt/mutt/tests/js/testFrame/testRestart/testApplicationDisconnect/manifest.ini"

After the first restart Firefox complains about the used socket port by not being able to bind to it.
Blocks: 960495
With my patch on bug 959551 attached it is kinda easy to solve this bug. The underlying problem are really those zombie processes which keep the socket open. With having them all gone, retrying a couple of times - until they are gone - let us reconnect to the socket.

A patch for this bug will be up by tomorrow.
Flags: needinfo?(jduell.mcbugs)
Attached patch Patch v1 (obsolete) — Splinter Review
With those changes we should never experience the socket issues again.
Attachment #8358376 - Attachment is obsolete: true
Attachment #8364268 - Flags: feedback?(cosmin.malutan)
Comment on attachment 8364268 [details] [diff] [review]
Patch v1

Thanks I applied both patches and I didn't get the socket error.
Attachment #8364268 - Flags: feedback?(cosmin.malutan) → feedback+
Comment on attachment 8364268 [details] [diff] [review]
Patch v1

Review of attachment 8364268 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Cosmin. That's helpful if all is working well now. Asking Dave for review of this patch.
Attachment #8364268 - Flags: review?(dave.hunt)
Summary: Exception: Cannot connect to jsbridge extension, port XYZ due to 'Error: Socket failed to bind, kill all firefox processes' → JSBridge server cannot be started after a restart of Firefox because it is still in use
Attached patch Patch v1.1Splinter Review
Updated patch with a log message when run in --debug mode.
Attachment #8364268 - Attachment is obsolete: true
Attachment #8364268 - Flags: review?(dave.hunt)
Attachment #8364925 - Flags: review?(dave.hunt)
Comment on attachment 8364925 [details] [diff] [review]
Patch v1.1

Review of attachment 8364925 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks Henrik!
Attachment #8364925 - Flags: review?(dave.hunt) → review+
https://github.com/mozilla/mozmill/commit/c82d2d2948ee957e1cc516e1407c92594b4e5283 (master)
https://github.com/mozilla/mozmill/commit/3698bacc8a5f7613ee50f70193fa0f79a003cf2f (hotfix-2.0)

Patches landed and the problem should be fixed for Mozmill. I will observe my Linux box in the next days and run a couple of debug builds to check if we still have this hang in PR_Bind. If that is the case I will file an NSPR bug.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: