Last Comment Bug 922995 - Some restart tests are not being run in sequence and cause various failures (like persisted.addon not defined)
: Some restart tests are not being run in sequence and cause various failures (...
Status: RESOLVED FIXED
[mozmill-2.0.1+]
:
Product: Testing Graveyard
Classification: Graveyard
Component: Mozmill (show other bugs)
: unspecified
: All All
: P1 critical
: ---
Assigned To: Cosmin Malutan, [:cosmin-malutan]
: Henrik Skupin (:whimboo)
Mentors:
http://mozmill-staging.blargon7.com/#...
: 923082 923091 (view as bug list)
Depends on: 874502 966235
Blocks: 933132 931828
  Show dependency treegraph
 
Reported: 2013-10-02 01:19 PDT by Andreea Matei [:AndreeaMatei]
Modified: 2016-08-24 09:02 PDT (History)
5 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
testcase 1 (2.60 KB, patch)
2013-10-07 05:56 PDT, Andrei Eftimie
no flags Details | Diff | Splinter Review
testcase.js (866 bytes, text/javascript)
2013-10-08 03:44 PDT, Andrei Eftimie
no flags Details
debug-fail.txt (30.04 KB, text/plain)
2013-10-09 00:59 PDT, Andrei Eftimie
no flags Details
0001-Bug-922995-Added-a-sleep-in-restartApplication-to-le.patch (1.33 KB, patch)
2013-10-14 04:45 PDT, Andrei Eftimie
no flags Details | Diff | Splinter Review
patch_v2 (1.99 KB, patch)
2013-10-14 06:29 PDT, Andrei Eftimie
no flags Details | Diff | Splinter Review
patch v3.0 (1.92 KB, patch)
2013-10-24 06:01 PDT, Cosmin Malutan, [:cosmin-malutan]
hskupin: review-
Details | Diff | Splinter Review
patch_v3.0 (924 bytes, patch)
2013-10-24 08:12 PDT, Cosmin Malutan, [:cosmin-malutan]
hskupin: review+
Details | Diff | Splinter Review
mutt_test (5.54 KB, patch)
2013-10-25 15:36 PDT, Cosmin Malutan, [:cosmin-malutan]
hskupin: feedback+
Details | Diff | Splinter Review
mutt_test (7.75 KB, patch)
2013-10-30 04:05 PDT, Cosmin Malutan, [:cosmin-malutan]
no flags Details | Diff | Splinter Review

Description Andreea Matei [:AndreeaMatei] 2013-10-02 01:19:51 PDT
We caught this with Mozmill 2.0 on testruns, with Aurora on Windows 8:
http://mozmill-crowd.blargon7.com/#/functional/report/6ec6776efe900da3fd2b64a750be3d7a

And once on Ubuntu 13.04:
http://mozmill-staging.blargon7.com/#/functional/report/6ec6776efe900da3fd2b64a750403c03

There were different tests failing with the same error, Andrei has investigated this a bit yesterday.
Comment 1 Andrei Eftimie 2013-10-02 01:28:51 PDT
The problem is that test1 has not been run at all in both cases!
The test is does not appear in any reporting, it is missing in its entirety.
Comment 2 Henrik Skupin (:whimboo) 2013-10-02 06:06:29 PDT
Which branches and platforms are affected here and is it reproducible? Andreea, your input would be kinda appreciated.
Comment 3 mario garbi 2013-10-02 06:56:02 PDT
*** Bug 923091 has been marked as a duplicate of this bug. ***
Comment 4 Andreea Matei [:AndreeaMatei] 2013-10-02 07:23:28 PDT
I reproduced so far only in testruns, with Aurora on Linux and Windows 8. I'm checking OS X now as well.
With running just the manifest (the one for restart tests) it's not failing for me.
Comment 5 Henrik Skupin (:whimboo) 2013-10-02 08:08:25 PDT
This issue has been started to appear with the mozmill-environment of Mozmill 2.0 which I have activated on Sep 25th on staging. There are no other reports before that date available in the staging dashboard. Between 2.0rc6 and 2.0 final no package dependencies have been changed for mozmill. This might be something related to our mozmill-environment.
Comment 6 Henrik Skupin (:whimboo) 2013-10-02 08:10:17 PDT
*** Bug 923082 has been marked as a duplicate of this bug. ***
Comment 7 Henrik Skupin (:whimboo) 2013-10-02 08:55:34 PDT
This failed a lot while Mozmill 2.0 was active in production CI. It's a blocker for us to get 2.0 enabled again. So please consider it as a P1.
Comment 8 mario garbi 2013-10-02 23:28:53 PDT
Setting it as P1.
Comment 9 Andrei Eftimie 2013-10-03 02:34:07 PDT
I can reproduce this on both Aurora and Nightly.

This looks like something regressed in Firefox as I can always reproduce this failure with latest builds, but not with older ones.

Working towards a short regression range.
Comment 10 Henrik Skupin (:whimboo) 2013-10-03 04:10:51 PDT
Interesting. It could be something related to NSPR so that we fail sometimes to create a jsbridge connection and silently drop tests. Whatever it is, I think there will be a bug in Mozmill too. But lets figure out first what has been caused that.

Andrei, on which platform can you always reproduce that? I wasn't able yesterday.
Comment 11 Andrei Eftimie 2013-10-03 04:37:47 PDT
Windows 8, I'm running a reduced testrun (only restart tests).
I'm seeing a failure rate of about 1 in 3-4.

So far:
> 2013-09-17 good // I've run this build about 10 times with no failure
> 2013-09-27 bad http://mozmill-crowd.blargon7.com/#/functional/report/6ec6776efe900da3fd2b64a750d5eea2

I'm also inserting some dumps into mozmill now to also try and pinpoint the failure place at the same time as reducing the regression range
Comment 12 Andrei Eftimie 2013-10-03 06:47:35 PDT
Going slowly since I've gotten more false positives (few runs with no failures, just to mark the build as bad on a later run).

So far:
2013-09-18 good
2013-09-21 bad http://mozmill-crowd.blargon7.com/#/functional/report/6ec6776efe900da3fd2b64a750e297bc

A pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ab4ccf3d6b60&tochange=a2c31dc69ab3
(still to big to be really useful)

We do have an update to NSPR in the 20th, see bug 917391
Comment 13 Henrik Skupin (:whimboo) 2013-10-03 20:12:56 PDT
Hm, the patch for the mentioned bug also got landed for beta. Do you see that failure there too? If yes, did Aurora and Beta regress between Sep 23 and 24? Have you gotten any more insight on that especially the mozmill logs? I don't see that much added to the bug for a day of work.
Comment 14 Andrei Eftimie 2013-10-04 06:41:53 PDT
This looks indeed like a Socket communication problem.

Here we pass the testFile to be run to frame.js:
https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/__init__.py#L355

But in frame.js we never execute anything:
https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/extension/resource/modules/frame.js#L773

When this fails:
> single test: c:\users\user\appdata\local\temp\tmpwjkxbn.mozmill-tests\tests\functional\restartTests\testAddons_enableDisableExtension\test2.js
> 
> TEST-START | restartTests\testAddons_enableDisableExtension\test2.js | setupModule
> TEST-START | restartTests\testAddons_enableDisableExtension\test2.js | testDisableExtension
> TEST-PASS | restartTests\testAddons_enableDisableExtension\test2.js | testDisableExtension
> TEST-START | restartTests\testAddons_enableDisableExtension\test2.js | teardownModule
> TEST-END | restartTests\testAddons_enableDisableExtension\test2.js | finished in 1131ms
> 
> single test: c:\users\user\appdata\local\temp\tmpwjkxbn.mozmill-tests\tests\functional\restartTests\testAddons_enableDisableExtension\test3.js
> 
> 
> single test: c:\users\user\appdata\local\temp\tmpwjkxbn.mozmill-tests\tests\functional\restartTests\testAddons_enableDisableExtension\test4.js

The dump above is inserted right before this line:
https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/__init__.py#L355

I have not been able to have any output dumped in frame js when this fails.

I am also puzzled by how the runner gets invoked, even in a passing run (with an older build),
we still see the frame.Runner being killed after test1 and not reinitialized for the other tests.

> TEST-START | restartTests\testPreferences_masterPassword\test1.js | setupModule
> TEST-START | restartTests\testPreferences_masterPassword\test1.js | testSetMasterPassword
> TEST-PASS | restartTests\testPreferences_masterPassword\test1.js | testSetMasterPassword
> TEST-START | restartTests\testPreferences_masterPassword\test1.js | teardownModule
> TEST-END | restartTests\testPreferences_masterPassword\test1.js | finished in 1626ms
> 
>  !this.ended: true
> 
>  runner END
> 
>  !this.ended: false
> 
> single test: c:\users\user\appdata\local\temp\tmpxfd78i.mozmill-tests\tests\functional\restartTests\testPreferences_masterPassword\test2.js
> 
> TEST-START | restartTests\testPreferences_masterPassword\test2.js | setupModule
> TEST-START | restartTests\testPreferences_masterPassword\test2.js | testInvokeMasterPassword
> TEST-PASS | restartTests\testPreferences_masterPassword\test2.js | testInvokeMasterPassword
> TEST-START | restartTests\testPreferences_masterPassword\test2.js | teardownModule
> TEST-END | restartTests\testPreferences_masterPassword\test2.js | finished in 655ms
> 
> single test: c:\users\user\appdata\local\temp\tmpxfd78i.mozmill-tests\tests\functional\restartTests\testPreferences_masterPassword\test3.js
> 
> TEST-START | restartTests\testPreferences_masterPassword\test3.js | setupModule
> TEST-START | restartTests\testPreferences_masterPassword\test3.js | testRemoveMasterPassword
> TEST-PASS | restartTests\testPreferences_masterPassword\test3.js | testRemoveMasterPassword
> TEST-START | restartTests\testPreferences_masterPassword\test3.js | teardownModule
> TEST-END | restartTests\testPreferences_masterPassword\test3.js | finished in 452ms

Is this by design? I am missing something here.
We still call the same function from frame.js: runTestFile(), yet it doesn't appear to be executing, but we are running the file itself...

===

But I've also ruled out bug 917391 by reproducing the failure with a tinderbox build right before that patch landed.
Reproduced with both http://hg.mozilla.org/integration/mozilla-inbound/rev/ae623dc4f044 and http://hg.mozilla.org/integration/mozilla-inbound/rev/d7118589114f

Still puzzled by when or why this happens.
Comment 15 Andrei Eftimie 2013-10-07 00:45:42 PDT
Running more test on mozilla-inbound led me to this pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=e626543f0a79&tochange=8db44525abf6

If this is correct we should look at bug 874502.
At a first glance we might not wait enough (or properly) after we close a window.
Comment 16 Andrei Eftimie 2013-10-07 00:57:21 PDT
Ehh, most of the mentioned changes are for fixing tests.
I'm not sure, but this appears to be the real change: http://hg.mozilla.org/integration/mozilla-inbound/rev/8db44525abf6
Comment 17 Andrei Eftimie 2013-10-07 03:14:04 PDT
I've been able to reproduce this directly with running tests with mozmill.
> mozmill -m %path%/functional/restartTests/manifest.ini -b %latestFirefox%

Lets see if I can fail it often enough to build a testcase around it.
Comment 18 Andrei Eftimie 2013-10-07 05:56:52 PDT
Created attachment 814073 [details] [diff] [review]
testcase 1

Attached is a testcase.
It is a patch since it still relies on some mozmill-tests libs.

I will reduce it further, but I'm uploading it now to have an update on my progress.

This will _not_ fail every time. I see it failing about 10%.
On Windows 8.

When it does fail, I also see the following message in the console:
> IO Completion Port unexpectedly closed
Comment 19 Andrei Eftimie 2013-10-08 03:44:28 PDT
Created attachment 814339 [details]
testcase.js

Minimized testcase attached.

This one will also throw an error:
> Parent process 5032 exited with children alive:
> PIDS: 3644
> Attempting to kill them...
> Parent process 5032 exited with children alive:
> PIDS: 3644
> Attempting to kill them...
> Error Code 6 trying to query IO Completion Port, exiting
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File "C:\mozmill-env\2.0-windows\mozmill-env\python\Lib\threading.py", line 551, in __bootstrap_inner
>     self.run()
>   File "C:\mozmill-env\2.0-windows\mozmill-env\python\Lib\threading.py", line 504, in run
>     self.__target(*self.__args, **self.__kwargs)
>   File "c:\mozmill-env\2.0-windows\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 282, in _procmgr
>     self._poll_iocompletion_port()
>   File "c:\mozmill-env\2.0-windows\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 332, in _poll_iocompletion_port
>     raise WinError(errcode)
> WindowsError: [Error 6] The handle is invalid.

If we let test2 in a different file we don't see this error, we fail silently and the console only logs:
> IO Completion Port unexpectedly closed

For me this has a failure rate of about 10% on Windows 8
Comment 20 Andrei Eftimie 2013-10-08 04:34:16 PDT
I'm also confirming bug 874502 as a regressor.
100+ successful runs with the tinderbox build pre bug 874502 landing on inbound.
Comment 21 Henrik Skupin (:whimboo) 2013-10-08 08:49:30 PDT
Hm, it could be why I see so many js errors in the console with 1.5:

00:03:17.746 [Exception... "'[JavaScript Error: "aWindow.gBrowser is null" {file: "resource://mozmill/modules/init.js" line: 76}]' when calling method: [nsIDOMEventListener::handleEvent]"  nsresult: "0x80570021 (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)"  location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0"  data: yes]

Mozmill 2.0 might handle it differently and fail.
Comment 22 Andrei Eftimie 2013-10-09 00:24:25 PDT
I've been able to see this on OSX as well (no error message here though):
> OpenGL version detected: 210
> OpenGL version detected: 210
> TEST-START | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test1
> TEST-PASS | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test1
> TEST-END | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | finished in 3090ms
> OpenGL version detected: 210
> OpenGL version detected: 210
> TEST-START | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test2
> TEST-PASS | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test2
> TEST-END | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | finished in 1ms
> RESULTS | Passed: 2
> RESULTS | Failed: 0
> RESULTS | Skipped: 0
> OpenGL version detected: 210
> OpenGL version detected: 210
> TEST-START | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test1
> TEST-PASS | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test1
> TEST-END | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | finished in 4755ms
> OpenGL version detected: 210
> OpenGL version detected: 210
> RESULTS | Passed: 1
> RESULTS | Failed: 0
> RESULTS | Skipped: 0

A 1% failure rate on OSX
Comment 23 Andrei Eftimie 2013-10-09 00:59:19 PDT
Created attachment 814772 [details]
debug-fail.txt

--debug does not help us, attaching the log just for reference.

The only difference between a successful and a failed run is the failure as mentioned in comment 19 (+timestamps and temporary paths, but these are expected)
Comment 24 Andrei Eftimie 2013-10-09 01:10:58 PDT
It looks like we are trying to close the socket while it still has some active connections.

Looking into a fix, probably wait for all socket connections to close before trying to close the it)
Comment 25 Andrei Eftimie 2013-10-09 05:08:18 PDT
Henrik, Dave if any of you have experience with how we treat processes in mozbase and how the socket communication works with NSS/NSPR, I would really welcome some tips.

And a more poignant question, see how the processes are handled in mozbase:
> $ mozmill -t testcase/testcase.js -b /c/Mozilla/Nightly/en-US/firefox/firefox.exe
> DBG::MOZPROC Self.pid value is: 4404
> DBG::MOZPROC new process detected with pid value: 5160
> DBG::MOZPROC process id 5160 exited normally
> TEST-START | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | test1
> TEST-PASS | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | test1
> TEST-END | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | finished in 3873ms
> DBG::MOZPROC new process detected with pid value: 4784
> DBG::MOZPROC process id 4404 exited normally
> TEST-START | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | test2
> TEST-PASS | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | test2
> TEST-END | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | finished in 2ms
> DBG::MOZPROC process id 4784 exited normally
> DBG::MOZPROC job object msg active processes zero
> RESULTS | Passed: 2
> RESULTS | Failed: 0
> RESULTS | Skipped: 0

This is for a passing example, when we fail we never close the 2nd child process.

Now, is it normal that we:
1) open the MAIN process
2) open child_1 process
3) close child_1 process
4) open child_2 process
5) close the MAIN process
6) close child_2 process

I don't have a clear grasp on how this should work, but shouldn't all child processes be killed before we kill the main process?

When this fails, we never see 6) child_2 is not closed.
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L315
portstatus is always 0, and we timeout after 180 seconds here:
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L325

Any help/ideas would be appreciated.
Thanks
Comment 26 Andrei Eftimie 2013-10-11 00:53:22 PDT
I've set up mm-win-8-64-3 to be able to test this failure.
It reproduces there at about the same rate I've seen locally 10-20%

All is setup in /c/922995/
I've also setup mozmill and mozbase, you will probably need to clear other versions of used packages in C:\922995\mozmill-env\python\Lib\site-packages

There is a terminal open that shows the latest failure message.
Comment 27 Andrei Eftimie 2013-10-14 04:45:41 PDT
Created attachment 816525 [details] [diff] [review]
0001-Bug-922995-Added-a-sleep-in-restartApplication-to-le.patch

This is a hack since I still don't know exactly what or when to correctly wait for.

But if we let enough time for the broker to send the shutdown message trough the socket we don't fail anymore.

I have had 100 testcase runs without failure with this 100ms sleep.
The time is arbitrary.

We still fail with a sleep of 0ms, so just breaking the thread does not help.

I'm not sure: should we have some sort of callback that the message has been received? Just a confirmation that the message has been received. But that will have to travel through the socket again...

I'm running full testruns now to see if this still fails. Maybe we should be waiting more than a 100ms. 

I don't think stopApplication() needs this sleep, the testcase doesn't handle that, but it should be visible once I run more full testruns.
Comment 28 Andrei Eftimie 2013-10-14 05:10:20 PDT
We also need the sleep for stopApplication()
We failed to run test1 here:
http://mozmill-crowd.blargon7.com/#/functional/report/ec449c026814fd64783d738c025fe34e
Comment 29 Andrei Eftimie 2013-10-14 06:29:14 PDT
Created attachment 816567 [details] [diff] [review]
patch_v2

Added a sleep also for stopApplication().

Running 100 testruns on windows, they might finish by tomorrow, reporting on mozmill-crowd. So far 3 testruns without a problem.
Comment 30 Andrei Eftimie 2013-10-14 22:59:10 PDT
Just to confirm: 101 successful functional testruns!
Comment 31 Andrei Eftimie 2013-10-23 03:21:06 PDT
Here is a crazy idea. We've talked about it in the office, and it might make sense.

What if the recent WSAEINVAL failures are related to this issue?
WSAEINVAL on windows is a Socket communication problem.
In mozmill 1.5 we don't forcefully close Firefox so we're not encountering the same Parent / Child process issues we've seen with 2.0

Here is (a very crude and not very accurate) plot of the failures by date:
> 2013-02-25 :: 1
> 2013-03-29 :: 1
> 2013-09-06 :: 1
> 2013-09-18 :: 1
> 2013-09-25 :: 1
> 2013-09-27 :: 2
> 2013-09-30 :: 1
> 2013-10-01 :: 2
> 2013-10-11 :: 2
> 2013-10-14 :: 2
> 2013-10-15 :: 1
> 2013-10-18 :: 2
> 2013-10-19 :: multiple failures (didn't retrigger over the weekend failures
> 2013-10-20 :: due to full queue after merge)
> 2013-10-21 :: 1
> 2013-10-22 :: 3
> 2013-10-23 :: 4

These are sampled from failed build emails containing "WSAEINVAL".
Notice how they pick up right after 18 Sept - that's when our regressor landed.

I'm not willing to bet money on this (well, maybe 5€).
But they might be related.

If this is indeed the case, we should push to have this fixed and use mozmill 2.0 in production. We've seen the WSAEINVAL failure 4-5 times per day lately.
Comment 32 Henrik Skupin (:whimboo) 2013-10-23 05:13:09 PDT
For which versions of Firefox are those numbers?
Comment 33 Andrei Eftimie 2013-10-23 05:30:40 PDT
> For which versions of Firefox are those numbers?

Valid point. Just checked and you are right. Versions are all over the place.
Discard comment 31. There can't be any correlation there.
Comment 34 Dave Hunt (:davehunt) 2013-10-23 06:18:32 PDT
Comment on attachment 816567 [details] [diff] [review]
patch_v2

Review of attachment 816567 [details] [diff] [review]:
-----------------------------------------------------------------

I'd like to defer to Henrik on this patch, as he's much more familiar with the Mozmill internals than me.
Comment 35 Cosmin Malutan, [:cosmin-malutan] 2013-10-24 06:01:18 PDT
Created attachment 821662 [details] [diff] [review]
patch v3.0

Hi, I had a look on this bug yesterday, and from what I saw that we are doing differently from 1.5, is that we are not closing the application with force-quit flag, and we don't check if we had granted the access to close the application by some others(addons/modals). So I added an observer to watch if we can close the application, and if we can't I set the flag to force-quit. I tested this on all machines to be sure I don't broke anything.
Also Andrei runned tests on the machine he reproduced this failure before from the morning and we have't seen any more this failure.
Comment 37 Henrik Skupin (:whimboo) 2013-10-24 07:08:52 PDT
Comment on attachment 821662 [details] [diff] [review]
patch v3.0

Review of attachment 821662 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozmill/mozmill/extension/resource/modules/frame.js
@@ +80,5 @@
> +    });
> +  } catch (e) {
> +    // Close the application forcibly, if we had not been granted access to close it
> +    // by some other subprocesses so they won't hang
> +    flags |= Ci.nsIAppStartup.eForceQuit;

Oh!! That's pretty interesting. Where has this flag been gone? Probably I missed to add this in one of my larger frame.js patches. :/ Very good find!

So what we should do here is to always have the eForceQuit flag set. We can never trust any other module or add-on, which can interfere with us. So triggering a close should really close Firefox.

So please add this to the default flags at the beginning of this method. All the other stuff (observer and waitFor) should not be necessary, right?
Comment 38 Cosmin Malutan, [:cosmin-malutan] 2013-10-24 08:12:34 PDT
Created attachment 821710 [details] [diff] [review]
patch_v3.0

(In reply to Henrik Skupin (:whimboo) from comment #37)
> So please add this to the default flags at the beginning of this method. All
> the other stuff (observer and waitFor) should not be necessary, right?
Right.
Comment 39 Henrik Skupin (:whimboo) 2013-10-24 08:56:01 PDT
Given that, I first thought we regressed in Mozmill via my patch on bug 771517. But goQuitApplication is also using eAttemptQuit. So not sure what actually caused this.
Comment 40 Henrik Skupin (:whimboo) 2013-10-24 09:09:42 PDT
Comment on attachment 821710 [details] [diff] [review]
patch_v3.0

Review of attachment 821710 [details] [diff] [review]:
-----------------------------------------------------------------

That looks reasonable. Have you run all mutt tests across platforms beside our Mozmill tests? If that's the case you get my r+
Comment 41 Cosmin Malutan, [:cosmin-malutan] 2013-10-24 23:37:41 PDT
Yes, I runned mutt tests on all platforms.

(In reply to Henrik Skupin (:whimboo) from comment #39)
> Given that, I first thought we regressed in Mozmill via my patch on bug
> 771517. But goQuitApplication is also using eAttemptQuit. So not sure what
> actually caused this.
I searched for goQuitApplication on mxr, but after I made a dump on console I see that it's also uses eAttemptQuit, but it also checks if it can close the application (canQuitApplication()). If canQuitApplication returns false, we won't close the application, and we would try to close the socket connection (self.bridge) while browser is opened, I think this is causing the WSAEINVAL error, bug 915563. I would like to take the machine offline after it starts failing to print some dumps to be sure.
Comment 42 Henrik Skupin (:whimboo) 2013-10-25 03:13:00 PDT
Yeah, that might be. But I'm not totally convinced of that, given that the WSEINVAL issue is existent for a while now. 

Cosmin, what about the testcase attached to that bug? Is it one we could make use of by a Mutt Python test? I would think so. I would feel more confident if it would be included.
Comment 43 Andrei Eftimie 2013-10-25 03:32:05 PDT
+1 for including the testcase as a mutt test
Comment 44 Henrik Skupin (:whimboo) 2013-10-25 08:58:25 PDT
Cosmin mentioned to me on IRC that for the test it would need a modal dialog during shutdown. What about the clear history one getting opened and not closed? I think that should work. Also all this sounds somewhat related to another bug we had for mozmill but we already thought it was fixed. Strange.
Comment 45 Cosmin Malutan, [:cosmin-malutan] 2013-10-25 11:10:06 PDT
Hi Henrik, you're right that fails to, but I guess that clear history it works like print page(system modal), it prevents the browser from closing, I need to be able to close the browser and the modal to remain open (because it's unresponding). But trying to close the browser with the print/ clear history open also results in "Timeout: bridge.execFunction" / "Disconnect Error: Application unexpectedly closed"
Comment 46 Henrik Skupin (:whimboo) 2013-10-25 12:09:27 PDT
I think we could open the modal dialog asynchronously via setTimeout() and a delay of e.g. 50ms. At the same time we are closing the browser. It would be worth a try. If it doesn't help we might want to take the test as it is even if it's not 100% reproducible before the fix.
Comment 47 Cosmin Malutan, [:cosmin-malutan] 2013-10-25 15:36:57 PDT
Created attachment 822590 [details] [diff] [review]
mutt_test

(In reply to Henrik Skupin (:whimboo) from comment #46)
> I think we could open the modal dialog asynchronously via setTimeout() and a
> delay of e.g. 50ms. At the same time we are closing the browser. It would be
> worth a try. 
That was brilliant, if I open the modal asynchronously and then call restartApplication asynchronously the modal it hangs for as long the callback runs. Unfortunately I still don't get that failure, but I did't ran the tests on a windows machine yet.
>If it doesn't help we might want to take the test as it is even
> if it's not 100% reproducible before the fix.

So I made a mutt test in python, in which I run two mozmill tests, and I check that both tests passes.
In one test I open the AddonsManager, and in one clear history.
Comment 48 Henrik Skupin (:whimboo) 2013-10-26 00:35:45 PDT
Comment on attachment 822590 [details] [diff] [review]
mutt_test

Review of attachment 822590 [details] [diff] [review]:
-----------------------------------------------------------------

Please see my inline comments. Otherwise please always combine patches and don't upload separate ones for the core fix and tests. Thanks.

::: mutt/mutt/tests/python/js-modules/testCloseApplication/testQuitApplicationWithAMOpen.js
@@ +20,5 @@
> +    }, "Selected category has been loaded", 30000);
> +  }
> +  finally {
> +    controller.window.document.removeEventListener("ViewChanged", onViewLoaded, false);
> +  }

What's the condition of this test to show the problem? As you said a modal dialog should be necessary. Here we don't have anything like that. I'm not certain that this test buys us that much.

::: mutt/mutt/tests/python/js-modules/testCloseApplication/testQuitApplicationWithHistoryOpen.js
@@ +14,5 @@
> +    controller.sleep(1000);
> +  }, 100);
> +
> +  controller.window.setTimeout(function () {
> +    controller.restartApplication("testTwo");

In our original tests we haven't had all the tests in a single file. So I wonder if that makes a difference in terms that you weren't able to see this problem.

@@ +18,5 @@
> +    controller.restartApplication("testTwo");
> +  }, 500);
> +}
> +
> +var testTwo = function () {}

As you can remember we had the problem that we didn't successfully run the first couple of tests in a sequence. So you might want to replicate that here and keep the status in the persisted object. An additional check at the end should make it safer for the js side.
Comment 49 Cosmin Malutan, [:cosmin-malutan] 2013-10-26 06:48:11 PDT
Hi Henrik, before I will update the patch I want to be sure I approach it correctly.

(In reply to Henrik Skupin (:whimboo) from comment #48)
> testQuitApplicationWithAMOpen.js
> @@ +20,5 @@
> > +    }, "Selected category has been loaded", 30000);
> > +  }
> > +  finally {
> > +    controller.window.document.removeEventListener("ViewChanged", onViewLoaded, false);
> > +  }
> 
> What's the condition of this test to show the problem? As you said a modal
> dialog should be necessary. Here we don't have anything like that. I'm not
> certain that this test buys us that much.
This is the testcase where it failed, if the Add-ons Manager freezes or doesn't closes for some reason we get:
>> Parent process 5032 exited with children alive:
>> PIDS: 3644
>> Attempting to kill them...
>> Parent process 5032 exited with children alive:
>> PIDS: 3644
>> Attempting to kill them...
This failure I tried to reproduce in mutt test. And it seems to be fixed if in method frame/shutdownApplication(), we close the browser with eForceQuit flag. So I would like to keep this testcase too, what do you say ?
  
> :::
> mutt/mutt/tests/python/js-modules/testCloseApplication/
> testQuitApplicationWithHistoryOpen.js
> @@ +14,5 @@
> > +    controller.sleep(1000);
> > +  }, 100);
> > +
> > +  controller.window.setTimeout(function () {
> > +    controller.restartApplication("testTwo");
> 
> In our original tests we haven't had all the tests in a single file. So I
> wonder if that makes a difference in terms that you weren't able to see this
> problem.
I testcase Andrei made they are on the same file, but I run testes from two different files so I would say that I treat both cases.
Also in python code I check that I had 2 successful tests so if one test it's missed the python assertion should fail.  
> @@ +18,5 @@
> > +    controller.restartApplication("testTwo");
> > +  }, 500);
> > +}
> > +
> > +var testTwo = function () {}
> 
> As you can remember we had the problem that we didn't successfully run the
> first couple of tests in a sequence. So you might want to replicate that
> here and keep the status in the persisted object. An additional check at the
> end should make it safer for the js side.
That sounds great, I will keep tracking in persisted object and check that all tests had been ran, and that they've been ran in the right order.

Again, my question is if I can keep the first test too, since is the on in which we reproduced the failure, and also I will have tests from two different files too ?
Comment 50 Henrik Skupin (:whimboo) 2013-10-28 04:16:59 PDT
(In reply to Cosmin Malutan from comment #49)
> > What's the condition of this test to show the problem? As you said a modal
> > dialog should be necessary. Here we don't have anything like that. I'm not
> > certain that this test buys us that much.
>
> This is the testcase where it failed, if the Add-ons Manager freezes or
> doesn't closes for some reason we get:

I don't see how we could force this to happen. And if that happens it's clearly a bug in Firefox. The UI should never freeze. So if it is reproducible it's another issue to get fixed. I would not include this test.

> >> Parent process 5032 exited with children alive:
> >> PIDS: 3644
> >> Attempting to kill them...
> >> Parent process 5032 exited with children alive:
> >> PIDS: 3644
> >> Attempting to kill them...
> This failure I tried to reproduce in mutt test. And it seems to be fixed if
> in method frame/shutdownApplication(), we close the browser with eForceQuit
> flag. So I would like to keep this testcase too, what do you say ?

I would suggest to analyze the problem of the freeze separately and not patch around it. Mozmill should be able to close Firefox in such a case, that's true. But let us not add this case which is mostly intermittent.

> > In our original tests we haven't had all the tests in a single file. So I
> > wonder if that makes a difference in terms that you weren't able to see this
> > problem.
>
> I testcase Andrei made they are on the same file, but I run testes from two
> different files so I would say that I treat both cases.

As long as we support both cases it might be a good idea to also test them both. Thanks.

> > As you can remember we had the problem that we didn't successfully run the
> > first couple of tests in a sequence. So you might want to replicate that
> > here and keep the status in the persisted object. An additional check at the
> > end should make it safer for the js side.
>
> That sounds great, I will keep tracking in persisted object and check that
> all tests had been ran, and that they've been ran in the right order.

Sounds good.
Comment 51 Andrei Eftimie 2013-10-28 10:26:30 PDT
> I would suggest to analyze the problem of the freeze separately and not patch around it.

Just to add my 2 cents here.
In the testcase attached to this bug nothing really freezes.

We wait for 180 seconds to timeout after we try to close the application but fail to close the child process (which is the socket communication) because it was still running at the time of the quit.

The 3 minutes are set here:
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L56
Comment 52 Henrik Skupin (:whimboo) 2013-10-29 01:25:47 PDT
180s for timing out what? This is still not clear to me. And when you are sure there is no freeze, what is it? The application doesn't seem to respond, right? Also who wrote the pointed out mozprocess code? We should bring this person into this discussion. Not sure why this hasn't happened earlier here, if it's really an issue with mozprocess.
Comment 53 Andrei Eftimie 2013-10-29 01:58:23 PDT
I am not sure I understand your question.
Here is the actual failure:
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L330

See the comment above it:
> # Arbitrarily wait 3 minutes for windows to get its act together
> # Windows sometimes takes a small nap between notifying the
> # IO Completion port and actually killing the children, and we
> # don't want to mistake that situation for the situation of an unexpected
> # parent abort (which is what we're looking for here).

> Not sure why this hasn't happened earlier here, if it's really an issue with mozprocess.
I'm not sure the issue is with mozprocess, we fail to correctly shut down the app given that we still have a socket active.

The specific code has been inserted by Clint:
https://github.com/mozilla/mozbase/commit/e5165109f69a80b2720a6dc1bb54af1fa05f1a9c

Clint, do you still remember the above referenced code? What it should do and in what circumstances?
Comment 54 Henrik Skupin (:whimboo) 2013-10-29 02:20:22 PDT
(In reply to Andrei Eftimie from comment #53)
> I'm not sure the issue is with mozprocess, we fail to correctly shut down
> the app given that we still have a socket active.

And here is the problem which you haven't really answered yet. What is preventing us from closing Firefox with the add-on manager open? There is no modal dialog, and as you said no freeze. But what else is blocking us? If you can reproduce (which I cannot) please find that out.
Comment 55 Cosmin Malutan, [:cosmin-malutan] 2013-10-30 04:05:28 PDT
Created attachment 824575 [details] [diff] [review]
mutt_test

I still don't understand why opening the add-ons manager prevents us from closing the runner.

I updated the mutt test, and what it does now is:
1- Set's persisted object and open addons manager - next test is in different file 
2- Checks that test1 was ran before and  close addons manager - next test is in the same file 
3- Checks that test2 was ran before  and open addons manager - next test is in the same file 
4- Checks that test3 was ran before and that test4 should be ran now.

On python side it also checks that all four test have been ran and that the last one was test4.

The mutt test failed once in a 100 runs without the fix, and it did't fail in 500 with the fix applied.
Comment 56 Henrik Skupin (:whimboo) 2013-10-30 04:24:09 PDT
I really don't want to hold of this bug for being fixed anytime longer. If there is no easy way to get this reproduced, I would suggest we land the fix and work on a specific test in a follow-up bug. It already takes a bit too long here.

What has to be done to wrap this bug up?
Comment 57 Cosmin Malutan, [:cosmin-malutan] 2013-10-30 05:12:47 PDT
Agreed, considering that fix is landed on master, I think we need to transplant the fix on hotfix-2.0 and raise a new follow-up bug for test.

Fix landed 6 days ago on master:
https://github.com/mozilla/mozmill/commit/e30fc0bad248792379a05a7714cabd5ac69e4aac
Comment 58 Henrik Skupin (:whimboo) 2013-10-30 06:29:43 PDT
Oh, that push was not intended. Seems like I pushed this accidentally with the most recent one. My bad. But given that it has been landed, lets get it cherry-picked to hotfix-2.0, yes:

https://github.com/mozilla/mozmill/commit/e712dcd09fa03838c6ad8ee40f9234148bca42c9

Cosmin, please file the new bug for the test and mark as dependency here. Thanks.

Note You need to log in before you can comment on or make changes to this bug.