Closed Bug 1046078 Opened 10 years ago Closed 10 years ago

JSbridge disconnect in testPreferences_masterPassword

Categories

(Mozilla QA Graveyard :: Mozmill Tests, defect, P1)

All
Windows 8
defect

Tracking

(firefox31 fixed, firefox32 fixed, firefox33 fixed, firefox34 fixed, firefox-esr24 wontfix, firefox-esr31 fixed)

RESOLVED FIXED
Tracking Status
firefox31 --- fixed
firefox32 --- fixed
firefox33 --- fixed
firefox34 --- fixed
firefox-esr24 --- wontfix
firefox-esr31 --- fixed

People

(Reporter: andrei, Assigned: cosmin-malutan)

References

Details

(Keywords: regression, reproducible, Whiteboard: [mozmill-test-failure])

Attachments

(8 files, 4 obsolete files)

Sample report: http://mozmill-daily.blargon7.com/#/functional/report/533154d2e18fa05bbdf2eaada8736bbb

Aurora 33 is affected. I can reproduce it on CI (windows) machines.
This does not reproduce locally.

Not sure if it's a regression in Firefox.
We did Windows OS updates a couple of days ago.
Checking both hypothesis atm.

Not all runs seem to be affected, see results from the last 3 days:
http://mozmill-daily.blargon7.com/#/functional/reports?app=All&branch=All&platform=Win&from=2014-07-27&to=2014-07-30

Affected are the ones with 32 tests.

While only running the masterPassword test I have a 50% reproduce rate.
When it fails I have the following message in the log:
> ###!!! [Child][MessageChannel] Error: Channel closing: too late to send/recv, messages will be lost

Which might imply that we don't properly restart Firefox.
Here's a more relevant slice of reports:
http://mozmill-daily.blargon7.com/#/functional/reports?app=All&branch=All&platform=All&from=2014-07-22

This failures seem to have started on 2014-07-22, which indicate that it has nothing to do with the Windows OS udpates, since we did these on the 28th and 29th.

We did however had the merge on the 22nd, so this looks like a regression.
Assignee: nobody → andrei.eftimie
Status: NEW → ASSIGNED
Only certain locales seem to be affected.
The dashboard shows failures for: de, ru and zh-CN
I can reproduce the failure with de. It doesn't reproduce with en-US
I am able to reproduce this locally.
Hopefully this will speed things up.

Unfortunately the oldest localised Aurora builds I can find on the FTP server are from the 24th, so this doesn't help much. I'll try finding exactly what (or why) is failing, maybe this will narrow where to look for a regressor.
Attached file testcase.js (obsolete) —
I get about 30-40% failure rate with this testcase (windows, de aurora build).
I'll look more into it tomorrow. I have no idea what the problem here. I suspect some race condition.
Attached patch skip.patchSplinter Review
This is failing often enough to warrant its disable.
I can complete a testrun with this test skipped on the affected locales on Aurora.
Attachment #8465396 - Flags: review?(andreea.matei)
Comment on attachment 8465396 [details] [diff] [review]
skip.patch

Review of attachment 8465396 [details] [diff] [review]:
-----------------------------------------------------------------

http://hg.mozilla.org/qa/mozmill-tests/rev/40192e011707 (aurora)
Attachment #8465396 - Flags: review?(andreea.matei) → review+
Whiteboard: [mozmill-test-failure] → [mozmill-test-failure][mozmill-test-skipped]
Attached file fail.txt
Running this with --debug I managed to get a pass and failure on the testcase.

This is the error:
> *************************
> A coding exception was thrown in a Promise resolution callback.
> See https://developer.mozilla.org/Mozilla/JavaScript_code_modules/Promise.jsm/Promise
> 
> Full message: TypeError: aWindow is null
> Full stack: pbu_privacyContextFromWindow@resource://gre/modules/PrivateBrowsingUtils.jsm:24:5
> pbu_isWindowPrivate@resource://gre/modules/PrivateBrowsingUtils.jsm:20:12
> LoginManagerContent.loginsFound@resource://gre/modules/LoginManagerContent.jsm:251:47
> Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:866:23
> this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:745:7
> Assert_waitFor@resource://mozmill/modules/assertions.js:595:7
> Httpd.prototype.stop@resource://mozmill/modules/frame.js:650:1
> AppQuitObserver.prototype.observe@resource://mozmill/modules/frame.js:439:11
> shutdownApplication@resource://mozmill/modules/frame.js:65:3
> events.startShutdown@resource://mozmill/modules/frame.js:169:5
> sendMessage@resource://mozmill/driver/msgbroker.js:44:5
> MozMillController.prototype.restartApplication@resource://mozmill/driver/controller.js:444:1
> test1@resource://mozmill/modules/frame.js -> file:///c:/work/bugs/1046078/testcase.js:12:3
> Runner.prototype.execFunction@resource://mozmill/modules/frame.js:752:5
> Runner.prototype.runTestModule@resource://mozmill/modules/frame.js:706:9
> Runner.prototype.runTestFile@resource://mozmill/modules/frame.js:690:3
> runTestFile@resource://mozmill/modules/frame.js:775:3
> Bridge.prototype._execFunction@resource://jsbridge/modules/Bridge.jsm:140:10
> Bridge.prototype.execFunction@resource://jsbridge/modules/Bridge.jsm:147:16
> @resource://jsbridge/modules/Server.jsm:32:1
> Server.Session/<@resource://jsbridge/modules/Server.jsm:32:5
> Sockets.Client.prototype.onMessage/event.notify@resource://jsbridge/modules/Sockets.jsm:40:11
> 
> *************************

Does look like a race condition. The window is closed before we are done with it?

I'm not sure why the top of the stack lies with PrivateBrowsingUtils.jsm:24:5 since we don't use PB in the testcase...
Attached file pass.txt
For comparison the passing test.
Uhg, I've just had this failing _without_ the exception mentioned in comment 9. That might be a different issue.
Attached file testcase.js (obsolete) —
Updated testcase.

We don't need to wait for anything else, just:
1. Open page
2. Wait for it to load
3. Open "about:newtab"
4. Call restartApplication()
Attachment #8464770 - Attachment is obsolete: true
Nightly is also affected.
I can reproduce it on the oldest l10n (de) Nightly build that is still available on the ftp server (2014-07-29).

en-US does not fail at all.
The message mentioned in comment 0 :
> ###!!! [Child][MessageChannel::SendAndWait] Error: Channel closing: too late to send/recv, messages will be lost
Also appears when this is not failing
Attached image about_newtab_en.jpg (obsolete) —
Here's about:newtab in en_US
Attached image about_newtab_de.jpg
about:newtab in de

Notice that the tiles are from the browsing history instead of the predefined tiles.

Theory: we close the browser while it's doing some operation in loading the page in these tiles. I'm not _sure_ if the "thumb" is an image or it's loading the actual webpage (I would assume it's a cached image).

I don't know why this behaviour is different on en-US vs other locales. I'll try finding some info on this.
Attached image about_newtab_en.jpg
Just a small update since the design of the "thumbs" was recently changed and in the previous screenshot I had an older Nightly build which had them looking differently.
Attachment #8467021 - Attachment is obsolete: true
This is the tracking bug for the changes to the tiles: bug 973273
Relevant changes are probably one of the issues tracked in bug 977787 (doing a lot of reading against them)

The "sponsored tiles" are only active for en-US. They are preffed as: `browser.newtabpage.directory.source: chrome://global/content/directoryLinks.json`

Where the directoryLinks.json contains entries only for en-US, that's where the difference between locales comes from.
So what happens when you disable the sponsored tiles for en-US? Will those builds also show this failure?
(In reply to Henrik Skupin (:whimboo) from comment #16)
> So what happens when you disable the sponsored tiles for en-US? Will those
> builds also show this failure?

That's what I have been hoping, this way I would be able to get the regressor.
Unfortunately I haven't yet been able to have a en-US build fail.
The directoryLinks.json file is stored in the profile, I'm now checking if there's anything else different between a `de` and a `en-US` profile.
I've rerun the testcase with mozmill/master, unfortunately the exit code is None, so we learnt nothing new here.
> ←[1;31mTEST-UNEXPECTED-FAIL←[0m | Connection to application lost (exit code: None)
I've actually also gotten the testcase to fail on Win with the latest Nightly en-US build:
> (restart) C:\work\mozmill>mozmill -t C:\work\bugs\1046078\testcase\testcase.js -
> b C:\Mozilla\Nightly\en-US\firefox-34.0a1.en-US.win32\firefox\firefox.exe
> ^/(.*)$
> ^/(.*)\.asis$
> ^/(.*)\.py$
> No handlers could be found for logger "wptserve"
> ^/(.*)$
> ←[1;34mTEST-START←[0m | C:\work\bugs\1046078\testcase\testcase.js | test1
> ←[1;32mTEST-PASS←[0m | C:\work\bugs\1046078\testcase\testcase.js | test1
> TEST-END | C:\work\bugs\1046078\testcase\testcase.js | finished in 1161ms
> 
> ###!!! [Child][MessageChannel] Error: Channel closing: too late to send/recv, me
> ssages will be lost
> 
> 2014-08-21 05:33:52: stackwalker.cc:125: INFO: Couldn't load symbols for: ntdll.
> pdb|86C5ABD953F644798B23D22818BE23A42
> 2014-08-21 05:33:52: stackwalker.cc:125: INFO: Couldn't load symbols for: kernel
> base.pdb|DD4522B344B04180A82AA5F4E0378A1A2
> 2014-08-21 05:33:52: stackwalker.cc:125: INFO: Couldn't load symbols for: nss3.p
> db|F04E131922E3412496921A8D8FBCEF451
> 2014-08-21 05:33:52: stackwalker.cc:125: INFO: Couldn't load symbols for: xul.pd
> b|713CFC6055EF4AEB8CF6347E4D5E0C0A2
> 2014-08-21 05:33:52: basic_code_modules.cc:88: INFO: No module at 0xd5f4a0
> 2014-08-21 05:33:52: basic_code_modules.cc:88: INFO: No module at 0xd5f5c8
> 2014-08-21 05:33:52: basic_code_modules.cc:88: INFO: No module at 0xd5f4a0
> 2014-08-21 05:33:52: basic_code_modules.cc:88: INFO: No module at 0x12314a0
> ←[1;31mTEST-UNEXPECTED-FAIL←[0m | Connection to application lost (exit code: Non
> e)
> ←[1;31mTEST-UNEXPECTED-FAIL←[0m | Disconnect Error: Application unexpectedly clo
> sed
> RESULTS | Passed: 1
> RESULTS | Failed: 1
> RESULTS | Skipped: 0
> Traceback (most recent call last):
>   File "c:\work\mozmill\mozmill\mozmill\__init__.py", line 899, in run
>     mozmill.run(tests, self.options.restart)
>   File "c:\work\mozmill\mozmill\mozmill\__init__.py", line 444, in run
>     self.handle_disconnect(e)
>   File "c:\work\mozmill\mozmill\mozmill\__init__.py", line 516, in handle_discon
> nect
>     self.stop_runner()
>   File "c:\work\mozmill\mozmill\mozmill\__init__.py", line 585, in stop_runner
>     raise errors.ShutdownError('client process shutdown unsuccessful')
> ShutdownError: client process shutdown unsuccessful

Not sure what's with the symbols since I used an opt build.
So what I've seen with this testcase on en-US is probably another issue, so I've filed bug 1057246.
(In reply to Andrei Eftimie from comment #18)
> I've rerun the testcase with mozmill/master, unfortunately the exit code is
> None, so we learnt nothing new here.

We know that we lost the connection to Firefox somehow, and that the application is still running! This might be also a longer delay during shutdown. So I wouldn't call it 'nothing new'.

So why no regression window test has been made here? It sounds like the problem is always reproducible.
(In reply to Henrik Skupin (:whimboo) from comment #21)
> So why no regression window test has been made here? It sounds like the
> problem is always reproducible.

The 'always reproducible' part was actually another bug, see comment 20: bug 1057246
This issue is only reproducible with localised builds and we lack older localised build on FPT, so we would require building Firefox locally with a locale (its possible that we don't need the full locale, only the directory tiles changes).

Unfortunately I can't use the testcase ATM until bug 1057246 gets fixed (or we can work around it somehow) because we're always run into that.
You could use an en-US build and installing the appropriate l10n XPI. Then you will have the version localized. I think that should perfectly be enough for testing.
(In reply to Henrik Skupin (:whimboo) from comment #23)
> You could use an en-US build and installing the appropriate l10n XPI. Then
> you will have the version localized. I think that should perfectly be enough
> for testing.

I didn't knew this was possible to localise FF only with an xpi. I'll look into it.
You can find specific language packs beside the aurora builds. They will work on all platforms even listed under win32: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014/08/2014-08-25-00-40-04-mozilla-aurora-l10n/win32/xpi/
Attached file testcase.js
There is no need for a localizated build here, similar to bug 1046645 this reproduces if we restart the browser from an about:newtab window while the tiles are loading.
For that we would only have to set the preference "browser.newtabpage.directory.source" to "data:application/json,{}".
I did reproduced this manually  with the 'Restartles Restart Add-on' but I couldn't caught it in a screencast.
Assignee: andrei.eftimie → cosmin.malutan
Attachment #8466966 - Attachment is obsolete: true
Also be aware that this reproduces only on windows.
OS: All → Windows 8
So what exactly happens here? Especially when you do it manually? I miss that important information. Also you most likely want to create a Mozmill bug for that.

For the tests we could workaround it. So any reason why we are using about:newtab and not about:blank in teardown? We only want to get rid of the loaded page, so the latter page should be perfectly fine.

http://hg.mozilla.org/qa/mozmill-tests/file/488cc02b7a85/firefox/tests/functional/restartTests/testPreferences_masterPassword/test1.js#l28
Right, so in tests, when we open a certain location, then open "about:newtab" and call the restart in test while the tiles are loading, the firefox won't restart after shutting down and we will get:
>TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed
> ....
> ....
>Exception: Cannot connect to jsbridge extension, port 56612

When I've done this manually, I prepared a profile with the 'Restartless Restart Add-on' installed, I opened the "about:newtab" page, I hit the restart button, firefox closed but it didn't come back.

We are using the about:newtab and not about:blank to be in sync with the closeAllTabs method, and we assume that a user will start a browsing session from about:newtab, but about:blank should be perfectly fine for a workaround. 

Regarding the mozmill bug I should file, I guess we have bug 974971 for tracking the root issue and we could provide the workaround in this bug.
Attached patch patch v1.0 (obsolete) — Splinter Review
Here is the patch for nightly, once backported on aurora we should backout the skip patch.
Attachment #8481130 - Flags: review?(hskupin)
Attachment #8481130 - Flags: review?(andrei.eftimie)
Depends on: 974971
(In reply to Cosmin Malutan from comment #29)

> Right, so in tests, when we open a certain location, then open
> "about:newtab" and call the restart in test while the tiles are loading, the
> firefox won't restart after shutting down and we will get:
> >TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed
> > ....
> > ....
> >Exception: Cannot connect to jsbridge extension, port 56612

Can you please try to run exactly this testcase with Mozmill 2.1 (latest master)? It should tell that Firefox hasn't been restarted even requested so. It would be a good proof.

> When I've done this manually, I prepared a profile with the 'Restartless
> Restart Add-on' installed, I opened the "about:newtab" page, I hit the
> restart button, firefox closed but it didn't come back.

Wonderful!! That's a perfect testcase for bug 974971, even I'm not sure if that is a different incarnation of the underlying issue.

> We are using the about:newtab and not about:blank to be in sync with the
> closeAllTabs method, and we assume that a user will start a browsing session
> from about:newtab, but about:blank should be perfectly fine for a
> workaround. 

If it turns out that about:newtab is really causing all this mess with restarts, we should indeed change it to about:blank too. But lets figure that out on the other bug first. Good catch Cosmin!

> Regarding the mozmill bug I should file, I guess we have bug 974971 for
> tracking the root issue and we could provide the workaround in this bug.

There is NO mozmill bug in this case. It's simply Firefox, which is playing bad with us. Mozmill should only report this correctly.
Comment on attachment 8481130 [details] [diff] [review]
patch v1.0

Review of attachment 8481130 [details] [diff] [review]:
-----------------------------------------------------------------

Please combine this patch with the unskip. Otherwise please see my inline comment.

::: firefox/tests/functional/restartTests/testPreferences_masterPassword/test1.js
@@ +34,5 @@
>  var teardownModule = function(aModule) {
> +  // Bug 1046078
> +  // We have to leave the test in about:blank instead of about:newtab because if
> +  // tiles are loading while we are shutting down, firefox won't restart
> +  controller.open("about:blank");

I would not call this a workaround. At this stage it is not necessary to use about:newtab at all. We just want to blank out the currently loaded web page. So just change the single line.
Attachment #8481130 - Flags: review?(hskupin)
Attachment #8481130 - Flags: review?(andrei.eftimie)
Attachment #8481130 - Flags: review-
Attached patch patch v1.1Splinter Review
Thanks Henrik, as discussed on IRC, I didn't add the un-skip patch, because on nightly the test is not skipped but we need the fix there too, so we will backout the skip-patch on aurora.
Attachment #8481130 - Attachment is obsolete: true
Attachment #8481148 - Flags: review?(hskupin)
Comment on attachment 8481148 [details] [diff] [review]
patch v1.1

Review of attachment 8481148 [details] [diff] [review]:
-----------------------------------------------------------------

Who-ever checks this in, please make sure to also unskip the test. Thanks Cosmin!
Attachment #8481148 - Flags: review?(hskupin) → review+
Lets do this change for all the branches regarding our stability work.
No longer depends on: 974971
Whiteboard: [mozmill-test-failure][mozmill-test-skipped] → [mozmill-test-failure]
All branches are affected, only on aurora it was skippe.
(In reply to Henrik Skupin (:whimboo) from comment #31)
> Can you please try to run exactly this testcase with Mozmill 2.1 (latest
> master)? It should tell that Firefox hasn't been restarted even requested
> so. It would be a good proof.
Balow is the log with mozmill 2.1
>
No handlers could be found for logger "wptserve"
^/(.*)$
TEST-START | c:\Users\svuser\Desktop\testCase.js | test1
TEST-PASS | c:\Users\svuser\Desktop\testCase.js | test1
TEST-END | c:\Users\svuser\Desktop\testCase.js | finished in 610ms

###!!! [Child][MessageChannel] Error: Channel closing: too late to send/recv, me       ssages will be lost

2014-08-29 11:54:40: stackwalker.cc:125: INFO: Couldn't load symbols for: ntdll.       pdb|CD4062A231154A17A18DAE7D1A0FBACC2
2014-08-29 11:54:40: stackwalker.cc:125: INFO: Couldn't load symbols for: kernel       32.pdb|17D8B8C647E14C1DAF9C6FC206AAAC512
2014-08-29 11:54:40: stackwalker.cc:125: INFO: Couldn't load symbols for: nss3.p       db|F82CA356137646B49C03516C04FD1E6F1
2014-08-29 11:54:40: stackwalker.cc:125: INFO: Couldn't load symbols for: xul.pd       b|FFA7CC61623349F98F1C2D417E6423442
2014-08-29 11:54:40: basic_code_modules.cc:88: INFO: No module at 0x37f7f4
2014-08-29 11:54:40: basic_code_modules.cc:88: INFO: No module at 0x37f91c
2014-08-29 11:54:40: basic_code_modules.cc:88: INFO: No module at 0x37f7f4
2014-08-29 11:54:40: basic_code_modules.cc:88: INFO: No module at 0x4ffb18
TEST-UNEXPECTED-FAIL | Connection to application lost (exit code: None)
TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed
RESULTS | Passed: 1
RESULTS | Failed: 1
RESULTS | Skipped: 0
Traceback (most recent call last):
  File "c:\Users\svuser\Desktop\mozmill-env\python\lib\site-packages\mozmill-2.1       _dev-py2.7.egg\mozmill\__init__.py", line 899, in run
    mozmill.run(tests, self.options.restart)
  File "c:\Users\svuser\Desktop\mozmill-env\python\lib\site-packages\mozmill-2.1       _dev-py2.7.egg\mozmill\__init__.py", line 444, in run
    self.handle_disconnect(e)
  File "c:\Users\svuser\Desktop\mozmill-env\python\lib\site-packages\mozmill-2.1       _dev-py2.7.egg\mozmill\__init__.py", line 516, in handle_disconnect
    self.stop_runner()
  File "c:\Users\svuser\Desktop\mozmill-env\python\lib\site-packages\mozmill-2.1       _dev-py2.7.egg\mozmill\__init__.py", line 585, in stop_runner
    raise errors.ShutdownError('client process shutdown unsuccessful')
ShutdownError: client process shutdown unsuccessful
Please do not forget to remove the checkin-needed keyword once the landing has been completed. Nothing exists for landing right now.
Keywords: checkin-needed
Here is the patch for esr31.
Attachment #8483343 - Flags: review?(andreea.matei)
Comment on attachment 8483343 [details] [diff] [review]
patch v1.1[esr31]

Review of attachment 8483343 [details] [diff] [review]:
-----------------------------------------------------------------

http://hg.mozilla.org/qa/mozmill-tests/rev/2aa30fcdfe6f (esr31)
Attachment #8483343 - Flags: review?(andreea.matei) → review+
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: