Closed Bug 643697 Opened 13 years ago Closed 13 years ago

JSBridge limits amount of data transferred between the application and CLI and causes disconnects in teardownModule

Categories

(Testing :: Mozbase, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: cmtalbert)

References

Details

(Whiteboard: [mozmill-2.0+][mozmill-1.5.3+][verified-1.5.3])

Attachments

(2 files, 2 obsolete files)

In the last couple of days we have noticed some application disconnects in the teardownModule method of our tests, even for those tests which have been disabled or tests without a teardownModule. Those disconnects are breaking our test-run. Not sure yet if that is a regression from Mozmill 1.5.1 or not.

As Jeff and Andrew told this is caused by large blocks of data send over the JSBridge connection. Andrew found that by sending screenshots encoded as base64. I for myself have discovered that as cause for bug 643491 yesterday. And we really don't send that much data. It's only some information about the update patches including URLs. Surprisingly it looks like there is a difference between the platforms because bug 643491 only happens on Windows 2000.

This is kinda a blocker for us because we need that additional information to get sent for investigations if an update fails. That means it would be very appreciated if we could fix that real quick and even for the Mozmill 1.5.x hotfix branch.
Just as an example. The following data we assign to the persisted object is enough to break the update tests on Windows 2000:

  "updates": [
    {
      "fallback": false,
      "build_pre": {
        "disabled_addons": "",
        "buildid": "20110320030417",
        "locale": "en-US",
        "version": "4.0b13pre",
        "user_agent": "Mozilla/5.0 (Windows NT 6.0; rv:2.0b13pre) Gecko/20110320 Firefox/4.0b13pre",
        "url_aus": "https://aus3.mozilla.org/update/3/Firefox/4.0b13pre/20110320030417/WINNT_x86-msvc/en-US/nightly/Windows_NT%206.0/default/default/update.xml?force=1"
      },
      "build_post": {
        "disabled_addons": "",
        "buildid": "20110321030419",
        "locale": "en-US",
        "version": "4.0b13pre",
        "user_agent": "Mozilla/5.0 (Windows NT 6.0; rv:2.0b13pre) Gecko/20110321 Firefox/4.0b13pre",
        "url_aus": "https://aus3.mozilla.org/update/3/Firefox/4.0b13pre/20110321030419/WINNT_x86-msvc/en-US/nightly/Windows_NT%206.0/default/default/update.xml?force=1"
      },
      "success": true,
      "patch": {
        "url_mirror": "http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011/03/2011-03-21-03-mozilla-central/firefox-4.0b13pre.en-US.win32.partial.20110320030417-20110321030419.mar",
        "buildid": "20110321030419",
        "download_duration": 800,
        "type": "minor",
        "is_complete": false,
        "channel": "nightly",
        "size": 1754368
      }
    }
I'd just like to point out that we don't know for a fact that it is the amount of data that causes this problem.  It is likely, true, but in fact we don't know exactly what the problem is
In the future, please file bugs with adequate steps to reproduce.  I wasted about an hour of Henrik's time this morning figuring out how to repro this, and since I can't seem to come up with a simple test that recreates the issue, let me document how to run all this.
I assume you're working from your user directory in msys on windows (repro'd on win7 64 bit).
= Setup =
1. hg clone http://hg.mozilla.org/qa/mozmill-tests
2. cd mozmill-tests;hg up -C mozmill2.0; cd ..
3. hg clone http://hg.mozilla.org/qa/mozmill-automation
4. Ensure your mozmill branch is pointing at hotfix-1.5.2, and you have python setup.py develop'd mozrunner, jsbridge, and mozmill
5. cd mozmill-tests; hg qnew myhappytest
6. Download Firefox 4 RC1: http://stage.mozilla.org/pub/mozilla.org/firefox/releases/4.0rc1/win32/en-US/

= Running the test =
1. cd mozmill-automation
2. python testrun_update.py --repository=../mozmill-tests <path_to_downloaded_firefox4_rc1_install_file>

= Expected =
Without any changes to the codebase, the teardown will fail due to a jsbridge disconnect error. Which is thrown from jsbridge/network.py:200

= Reverting Henrik's latest change that exposed this issue =
1. Edit mozmill-tests/lib/software-update.js:177 from url_aus : '<whatever>' to something like url_aus : 'xyz';
2. cd mozmill-tests; hg qrefresh -m "tests"; cd ..
3. Follow the steps in "Running the test" to run it again.

= Expected =
If you revert the url_aus line, then you will not see the socket.error thrown during the teardown module.
Attached patch Patch against 1.5.2 branch (obsolete) — Splinter Review
heather, take a look at this and let me know what you think of the approach.  I left all my debug logging in (note, only works on windows due to the logging file path in jsbridge) so you could easily verify it.

Essentially I am handling the case where jsbridge is writing something or reading something larger than its default buffer size.
Assignee: nobody → ctalbert
Status: NEW → ASSIGNED
Attachment #521309 - Flags: feedback?(fayearthur+bugs)
Attachment #521309 - Attachment is obsolete: true
Attachment #521309 - Flags: feedback?(fayearthur+bugs)
Attachment #521692 - Flags: review?(fayearthur+bugs)
Heather, it would be great when we could get this solved soon. We have more and more of those failures. Especially for our endurance tests.
Blocks: 645227
It looks like that this patch doesn't resolve the problem completely. Over on bug 645227 we have experienced a disconnect in teardownModule even for our endurance tests. This guy has applied this patch and the disconnect still happens:

TEST-START | /tmp/tmpJPwZ1l.mozmill-tests/tests/endurance/testTabView/testOpenNewTab.js | teardownModule
Timeout: bridge.execFunction("f1da7462-574b-11e0-a5da-00216321c4d1", bridge.registry["{38a49482-b69c-40e5-93a5-f2e30850dfc5}"]["runTestFile"], ["/tmp/tmpJPwZ1l.mozmill-tests/tests/endurance/testTabView/testOpenNewTab.js"])

Looks like we have a different issue existent though.
(In reply to comment #8)
> It looks like that this patch doesn't resolve the problem completely. Over on
> bug 645227 we have experienced a disconnect in teardownModule even for our
> endurance tests. This guy has applied this patch and the disconnect still
> happens:
> 
> TEST-START |
> /tmp/tmpJPwZ1l.mozmill-tests/tests/endurance/testTabView/testOpenNewTab.js |
> teardownModule
> Timeout: bridge.execFunction("f1da7462-574b-11e0-a5da-00216321c4d1",
> bridge.registry["{38a49482-b69c-40e5-93a5-f2e30850dfc5}"]["runTestFile"],
> ["/tmp/tmpJPwZ1l.mozmill-tests/tests/endurance/testTabView/testOpenNewTab.js"])
> 
> Looks like we have a different issue existent though.
That's not the same issue.  You are hitting the timeout not the "connected disconnected" error.  Since you're doing "endurance" tests, could you be actually going over the hard per test timeout in jsbridge?
(In reply to comment #9)
> > Looks like we have a different issue existent though.
> That's not the same issue.  You are hitting the timeout not the "connected
> disconnected" error.  Since you're doing "endurance" tests, could you be
> actually going over the hard per test timeout in jsbridge?

Oh you are probably right. I haven't seen this tiny difference in the failure message. I will investigate with Dave over on bug 645227.
No longer blocks: 645227
Comment on attachment 521692 [details] [diff] [review]
1.5.2 fix - final

This looks good, I haven't been able to reproduce this reliably but seems fine as long as its tested, just a couple nit picks:

>+const DEBUG_ON = 0;

this ain't C, there's a 'true' in js (;


> Session.prototype.onQuit = function() {
>+  closefile();

I can't seem to file this closefile() function defined anywhere?
Attachment #521692 - Flags: review?(fayearthur+bugs) → review+
Setting approval for 1.5.3 as discussed during this weeks meeting.
Whiteboard: [mozmill-2.0?][mozmill-1.5.x?] → [mozmill-2.0?][mozmill-1.5.3+]
Pushed to hotfix-1.5: https://github.com/mozautomation/mozmill/commit/48850e1ca7b033572e5143ac892ea8fe212683b3
Whiteboard: [mozmill-2.0?][mozmill-1.5.3+] → [mozmill-2.0+][mozmill-1.5.3+][fixed-1.5.3]
Works like a charm for 1.5.3.
Whiteboard: [mozmill-2.0+][mozmill-1.5.3+][fixed-1.5.3] → [mozmill-2.0+][mozmill-1.5.3+][verified-1.5.3]
This patch fixes 2.0 by extending the timeout in sending over the socket on the js side. It was timing out before and that's why it wasn't sending all the data.
Attachment #523148 - Attachment is obsolete: true
Attachment #526437 - Flags: review?(ctalbert)
Comment on attachment 526437 [details] [diff] [review]
Use the max timeout value for sending over NSPR socket

You forgot to include the jsbridgefilelogger.js file in your patch.

Otherwise though it looks good.  Please add in a new patch with that file added.

The screenshot tests (from Andrew's screen shot bug) are more than ample testing for this feature.
Attachment #526437 - Flags: review?(ctalbert) → review-
(In reply to comment #17)
> Comment on attachment 526437 [details] [diff] [review]
> Use the max timeout value for sending over NSPR socket
> 
> You forgot to include the jsbridgefilelogger.js file in your patch.

Where is this file, it wasn't in the 1.5 commit?
(In reply to comment #18)
> (In reply to comment #17)
> > Comment on attachment 526437 [details] [diff] [review]
> > Use the max timeout value for sending over NSPR socket
> > 
> > You forgot to include the jsbridgefilelogger.js file in your patch.
> 
> Where is this file, it wasn't in the 1.5 commit?

Ah, it's in my screenshot branch.  Sorry: https://github.com/ctalbert/mozmill/blob/screenshot/jsbridge/jsbridge/extension/resource/modules/jsbridgefilelogger.js

I added it because I got tired of re-writing the throw-away code to do logging on windows.  I figured it might come in handy next time we have to debug through jsbridge.
master:
https://github.com/mozautomation/mozmill/commit/cab4e710cd088b598959516f7399372f773ae29b
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: