922995 - Some restart tests are not being run in sequence and cause various failures (like persisted.addon not defined)

Reporter

Description

•

11 years ago

We caught this with Mozmill 2.0 on testruns, with Aurora on Windows 8:
http://mozmill-crowd.blargon7.com/#/functional/report/6ec6776efe900da3fd2b64a750be3d7a

And once on Ubuntu 13.04:
http://mozmill-staging.blargon7.com/#/functional/report/6ec6776efe900da3fd2b64a750403c03

There were different tests failing with the same error, Andrei has investigated this a bit yesterday.

Andrei Eftimie

Comment 1

•

11 years ago

The problem is that test1 has not been run at all in both cases!
The test is does not appear in any reporting, it is missing in its entirety.

Priority: -- → P2

Summary: Test failure "persisted.addon is undefined" in restartTests → Tests are not being run and reported in testrun under mozmill 2.0

Andrei Eftimie

Updated

•

11 years ago

Component: Mozmill Tests → Mozmill

Product: Mozilla QA → Testing

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 2

•

11 years ago

Which branches and platforms are affected here and is it reproducible? Andreea, your input would be kinda appreciated.

Andreea Matei [:AndreeaMatei]

Reporter

Comment 4

•

11 years ago

I reproduced so far only in testruns, with Aurora on Linux and Windows 8. I'm checking OS X now as well.
With running just the manifest (the one for restart tests) it's not failing for me.

status-firefox26: --- → affected

OS: Linux → All

Hardware: x86_64 → All

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

11 years ago

status-firefox26: affected → ---

Summary: Tests are not being run and reported in testrun under mozmill 2.0 → Some restart tests are not being run in sequence and cause various failures (like persisted.addon not defined)

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

11 years ago

status-firefox26: --- → affected

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 5

•

11 years ago

This issue has been started to appear with the mozmill-environment of Mozmill 2.0 which I have activated on Sep 25th on staging. There are no other reports before that date available in the staging dashboard. Between 2.0rc6 and 2.0 final no package dependencies have been changed for mozmill. This might be something related to our mozmill-environment.

Component: Mozmill → Mozmill Tests

Product: Testing → Mozilla QA

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 7

•

11 years ago

This failed a lot while Mozmill 2.0 was active in production CI. It's a blocker for us to get 2.0 enabled again. So please consider it as a P1.

mario garbi

Comment 8

•

11 years ago

Setting it as P1.

Priority: P2 → P1

Andrei Eftimie

Comment 9

•

11 years ago

I can reproduce this on both Aurora and Nightly.

This looks like something regressed in Firefox as I can always reproduce this failure with latest builds, but not with older ones.

Working towards a short regression range.

Assignee: nobody → andrei.eftimie

Status: NEW → ASSIGNED

status-firefox27: --- → affected

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 10

•

11 years ago

Interesting. It could be something related to NSPR so that we fail sometimes to create a jsbridge connection and silently drop tests. Whatever it is, I think there will be a bug in Mozmill too. But lets figure out first what has been caused that.

Andrei, on which platform can you always reproduce that? I wasn't able yesterday.

Andrei Eftimie

Comment 11

•

11 years ago

Windows 8, I'm running a reduced testrun (only restart tests).
I'm seeing a failure rate of about 1 in 3-4.

So far:
> 2013-09-17 good // I've run this build about 10 times with no failure
> 2013-09-27 bad http://mozmill-crowd.blargon7.com/#/functional/report/6ec6776efe900da3fd2b64a750d5eea2

I'm also inserting some dumps into mozmill now to also try and pinpoint the failure place at the same time as reducing the regression range

Andrei Eftimie

Comment 12

•

11 years ago

Going slowly since I've gotten more false positives (few runs with no failures, just to mark the build as bad on a later run).

So far:
2013-09-18 good
2013-09-21 bad http://mozmill-crowd.blargon7.com/#/functional/report/6ec6776efe900da3fd2b64a750e297bc

A pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ab4ccf3d6b60&tochange=a2c31dc69ab3
(still to big to be really useful)

We do have an update to NSPR in the 20th, see bug 917391

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 13

•

11 years ago

Hm, the patch for the mentioned bug also got landed for beta. Do you see that failure there too? If yes, did Aurora and Beta regress between Sep 23 and 24? Have you gotten any more insight on that especially the mozmill logs? I don't see that much added to the bug for a day of work.

Andrei Eftimie

Comment 14

•

11 years ago

This looks indeed like a Socket communication problem.

Here we pass the testFile to be run to frame.js:
https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/__init__.py#L355

But in frame.js we never execute anything:
https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/extension/resource/modules/frame.js#L773

When this fails:
> single test: c:\users\user\appdata\local\temp\tmpwjkxbn.mozmill-tests\tests\functional\restartTests\testAddons_enableDisableExtension\test2.js
> 
> TEST-START | restartTests\testAddons_enableDisableExtension\test2.js | setupModule
> TEST-START | restartTests\testAddons_enableDisableExtension\test2.js | testDisableExtension
> TEST-PASS | restartTests\testAddons_enableDisableExtension\test2.js | testDisableExtension
> TEST-START | restartTests\testAddons_enableDisableExtension\test2.js | teardownModule
> TEST-END | restartTests\testAddons_enableDisableExtension\test2.js | finished in 1131ms
> 
> single test: c:\users\user\appdata\local\temp\tmpwjkxbn.mozmill-tests\tests\functional\restartTests\testAddons_enableDisableExtension\test3.js
> 
> 
> single test: c:\users\user\appdata\local\temp\tmpwjkxbn.mozmill-tests\tests\functional\restartTests\testAddons_enableDisableExtension\test4.js

The dump above is inserted right before this line:
https://github.com/mozilla/mozmill/blob/master/mozmill/mozmill/__init__.py#L355

I have not been able to have any output dumped in frame js when this fails.

I am also puzzled by how the runner gets invoked, even in a passing run (with an older build),
we still see the frame.Runner being killed after test1 and not reinitialized for the other tests.

> TEST-START | restartTests\testPreferences_masterPassword\test1.js | setupModule
> TEST-START | restartTests\testPreferences_masterPassword\test1.js | testSetMasterPassword
> TEST-PASS | restartTests\testPreferences_masterPassword\test1.js | testSetMasterPassword
> TEST-START | restartTests\testPreferences_masterPassword\test1.js | teardownModule
> TEST-END | restartTests\testPreferences_masterPassword\test1.js | finished in 1626ms
> 
>  !this.ended: true
> 
>  runner END
> 
>  !this.ended: false
> 
> single test: c:\users\user\appdata\local\temp\tmpxfd78i.mozmill-tests\tests\functional\restartTests\testPreferences_masterPassword\test2.js
> 
> TEST-START | restartTests\testPreferences_masterPassword\test2.js | setupModule
> TEST-START | restartTests\testPreferences_masterPassword\test2.js | testInvokeMasterPassword
> TEST-PASS | restartTests\testPreferences_masterPassword\test2.js | testInvokeMasterPassword
> TEST-START | restartTests\testPreferences_masterPassword\test2.js | teardownModule
> TEST-END | restartTests\testPreferences_masterPassword\test2.js | finished in 655ms
> 
> single test: c:\users\user\appdata\local\temp\tmpxfd78i.mozmill-tests\tests\functional\restartTests\testPreferences_masterPassword\test3.js
> 
> TEST-START | restartTests\testPreferences_masterPassword\test3.js | setupModule
> TEST-START | restartTests\testPreferences_masterPassword\test3.js | testRemoveMasterPassword
> TEST-PASS | restartTests\testPreferences_masterPassword\test3.js | testRemoveMasterPassword
> TEST-START | restartTests\testPreferences_masterPassword\test3.js | teardownModule
> TEST-END | restartTests\testPreferences_masterPassword\test3.js | finished in 452ms

Is this by design? I am missing something here.
We still call the same function from frame.js: runTestFile(), yet it doesn't appear to be executing, but we are running the file itself...

===

But I've also ruled out bug 917391 by reproducing the failure with a tinderbox build right before that patch landed.
Reproduced with both http://hg.mozilla.org/integration/mozilla-inbound/rev/ae623dc4f044 and http://hg.mozilla.org/integration/mozilla-inbound/rev/d7118589114f

Still puzzled by when or why this happens.

Andrei Eftimie

Comment 15

•

11 years ago

Running more test on mozilla-inbound led me to this pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=e626543f0a79&tochange=8db44525abf6

If this is correct we should look at bug 874502.
At a first glance we might not wait enough (or properly) after we close a window.

Andrei Eftimie

Comment 16

•

11 years ago

Ehh, most of the mentioned changes are for fixing tests.
I'm not sure, but this appears to be the real change: http://hg.mozilla.org/integration/mozilla-inbound/rev/8db44525abf6

Andrei Eftimie

Comment 17

•

11 years ago

I've been able to reproduce this directly with running tests with mozmill.
> mozmill -m %path%/functional/restartTests/manifest.ini -b %latestFirefox%

Lets see if I can fail it often enough to build a testcase around it.

Andrei Eftimie

Comment 18

•

11 years ago

Attached patch testcase 1 (obsolete) — Details — Splinter Review

Attached is a testcase.
It is a patch since it still relies on some mozmill-tests libs.

I will reduce it further, but I'm uploading it now to have an update on my progress.

This will _not_ fail every time. I see it failing about 10%.
On Windows 8.

When it does fail, I also see the following message in the console:
> IO Completion Port unexpectedly closed

Andrei Eftimie

Comment 19

•

11 years ago

Attached file testcase.js — Details

Minimized testcase attached.

This one will also throw an error:
> Parent process 5032 exited with children alive:
> PIDS: 3644
> Attempting to kill them...
> Parent process 5032 exited with children alive:
> PIDS: 3644
> Attempting to kill them...
> Error Code 6 trying to query IO Completion Port, exiting
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File "C:\mozmill-env\2.0-windows\mozmill-env\python\Lib\threading.py", line 551, in __bootstrap_inner
>     self.run()
>   File "C:\mozmill-env\2.0-windows\mozmill-env\python\Lib\threading.py", line 504, in run
>     self.__target(*self.__args, **self.__kwargs)
>   File "c:\mozmill-env\2.0-windows\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 282, in _procmgr
>     self._poll_iocompletion_port()
>   File "c:\mozmill-env\2.0-windows\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 332, in _poll_iocompletion_port
>     raise WinError(errcode)
> WindowsError: [Error 6] The handle is invalid.

If we let test2 in a different file we don't see this error, we fail silently and the console only logs:
> IO Completion Port unexpectedly closed

For me this has a failure rate of about 10% on Windows 8

Attachment #814073 - Attachment is obsolete: true

Andrei Eftimie

Comment 20

•

11 years ago

I'm also confirming bug 874502 as a regressor.
100+ successful runs with the tinderbox build pre bug 874502 landing on inbound.

Depends on: 874502

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 21

•

11 years ago

Hm, it could be why I see so many js errors in the console with 1.5:

00:03:17.746 [Exception... "'[JavaScript Error: "aWindow.gBrowser is null" {file: "resource://mozmill/modules/init.js" line: 76}]' when calling method: [nsIDOMEventListener::handleEvent]"  nsresult: "0x80570021 (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)"  location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0"  data: yes]

Mozmill 2.0 might handle it differently and fail.

Andrei Eftimie

Comment 22

•

11 years ago

I've been able to see this on OSX as well (no error message here though):
> OpenGL version detected: 210
> OpenGL version detected: 210
> TEST-START | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test1
> TEST-PASS | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test1
> TEST-END | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | finished in 3090ms
> OpenGL version detected: 210
> OpenGL version detected: 210
> TEST-START | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test2
> TEST-PASS | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test2
> TEST-END | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | finished in 1ms
> RESULTS | Passed: 2
> RESULTS | Failed: 0
> RESULTS | Skipped: 0
> OpenGL version detected: 210
> OpenGL version detected: 210
> TEST-START | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test1
> TEST-PASS | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | test1
> TEST-END | /Users/andrei.eftimie/work/mozilla/bugs/922995/testcase/testcase.js | finished in 4755ms
> OpenGL version detected: 210
> OpenGL version detected: 210
> RESULTS | Passed: 1
> RESULTS | Failed: 0
> RESULTS | Skipped: 0

A 1% failure rate on OSX

Andrei Eftimie

Comment 23

•

11 years ago

Attached file debug-fail.txt — Details

--debug does not help us, attaching the log just for reference.

The only difference between a successful and a failed run is the failure as mentioned in comment 19 (+timestamps and temporary paths, but these are expected)

Andrei Eftimie

Comment 24

•

11 years ago

It looks like we are trying to close the socket while it still has some active connections.

Looking into a fix, probably wait for all socket connections to close before trying to close the it)

Andrei Eftimie

Comment 25

•

11 years ago

Henrik, Dave if any of you have experience with how we treat processes in mozbase and how the socket communication works with NSS/NSPR, I would really welcome some tips.

And a more poignant question, see how the processes are handled in mozbase:
> $ mozmill -t testcase/testcase.js -b /c/Mozilla/Nightly/en-US/firefox/firefox.exe
> DBG::MOZPROC Self.pid value is: 4404
> DBG::MOZPROC new process detected with pid value: 5160
> DBG::MOZPROC process id 5160 exited normally
> TEST-START | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | test1
> TEST-PASS | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | test1
> TEST-END | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | finished in 3873ms
> DBG::MOZPROC new process detected with pid value: 4784
> DBG::MOZPROC process id 4404 exited normally
> TEST-START | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | test2
> TEST-PASS | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | test2
> TEST-END | c:\work\bugs\922995\mozmill-tests\testcase\testcase.js | finished in 2ms
> DBG::MOZPROC process id 4784 exited normally
> DBG::MOZPROC job object msg active processes zero
> RESULTS | Passed: 2
> RESULTS | Failed: 0
> RESULTS | Skipped: 0

This is for a passing example, when we fail we never close the 2nd child process.

Now, is it normal that we:
1) open the MAIN process
2) open child_1 process
3) close child_1 process
4) open child_2 process
5) close the MAIN process
6) close child_2 process

I don't have a clear grasp on how this should work, but shouldn't all child processes be killed before we kill the main process?

When this fails, we never see 6) child_2 is not closed.
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L315
portstatus is always 0, and we timeout after 180 seconds here:
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L325

Any help/ideas would be appreciated.
Thanks

Flags: needinfo?(hskupin)

Flags: needinfo?(dave.hunt)

Andrei Eftimie

Comment 26

•

11 years ago

I've set up mm-win-8-64-3 to be able to test this failure.
It reproduces there at about the same rate I've seen locally 10-20%

All is setup in /c/922995/
I've also setup mozmill and mozbase, you will probably need to clear other versions of used packages in C:\922995\mozmill-env\python\Lib\site-packages

There is a terminal open that shows the latest failure message.

Andrei Eftimie

Comment 27

•

11 years ago

Attached patch 0001-Bug-922995-Added-a-sleep-in-restartApplication-to-le.patch (obsolete) — Details — Splinter Review

This is a hack since I still don't know exactly what or when to correctly wait for.

But if we let enough time for the broker to send the shutdown message trough the socket we don't fail anymore.

I have had 100 testcase runs without failure with this 100ms sleep.
The time is arbitrary.

We still fail with a sleep of 0ms, so just breaking the thread does not help.

I'm not sure: should we have some sort of callback that the message has been received? Just a confirmation that the message has been received. But that will have to travel through the socket again...

I'm running full testruns now to see if this still fails. Maybe we should be waiting more than a 100ms. 

I don't think stopApplication() needs this sleep, the testcase doesn't handle that, but it should be visible once I run more full testruns.

Attachment #816525 - Flags: feedback?(hskupin)

Attachment #816525 - Flags: feedback?(dave.hunt)

Flags: needinfo?(hskupin)

Flags: needinfo?(dave.hunt)

Andrei Eftimie

Comment 28

•

11 years ago

We also need the sleep for stopApplication()
We failed to run test1 here:
http://mozmill-crowd.blargon7.com/#/functional/report/ec449c026814fd64783d738c025fe34e

Andrei Eftimie

Comment 29

•

11 years ago

Attached patch patch_v2 (obsolete) — Details — Splinter Review

Added a sleep also for stopApplication().

Running 100 testruns on windows, they might finish by tomorrow, reporting on mozmill-crowd. So far 3 testruns without a problem.

Attachment #816525 - Attachment is obsolete: true

Attachment #816525 - Flags: feedback?(hskupin)

Attachment #816525 - Flags: feedback?(dave.hunt)

Attachment #816567 - Flags: feedback?(hskupin)

Attachment #816567 - Flags: feedback?(dave.hunt)

Andrei Eftimie

Comment 30

•

11 years ago

Just to confirm: 101 successful functional testruns!

Andrei Eftimie

Comment 31

•

11 years ago

Here is a crazy idea. We've talked about it in the office, and it might make sense.

What if the recent WSAEINVAL failures are related to this issue?
WSAEINVAL on windows is a Socket communication problem.
In mozmill 1.5 we don't forcefully close Firefox so we're not encountering the same Parent / Child process issues we've seen with 2.0

Here is (a very crude and not very accurate) plot of the failures by date:
> 2013-02-25 :: 1
> 2013-03-29 :: 1
> 2013-09-06 :: 1
> 2013-09-18 :: 1
> 2013-09-25 :: 1
> 2013-09-27 :: 2
> 2013-09-30 :: 1
> 2013-10-01 :: 2
> 2013-10-11 :: 2
> 2013-10-14 :: 2
> 2013-10-15 :: 1
> 2013-10-18 :: 2
> 2013-10-19 :: multiple failures (didn't retrigger over the weekend failures
> 2013-10-20 :: due to full queue after merge)
> 2013-10-21 :: 1
> 2013-10-22 :: 3
> 2013-10-23 :: 4

These are sampled from failed build emails containing "WSAEINVAL".
Notice how they pick up right after 18 Sept - that's when our regressor landed.

I'm not willing to bet money on this (well, maybe 5€).
But they might be related.

If this is indeed the case, we should push to have this fixed and use mozmill 2.0 in production. We've seen the WSAEINVAL failure 4-5 times per day lately.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 32

•

11 years ago

For which versions of Firefox are those numbers?

Andrei Eftimie

Comment 33

•

11 years ago

> For which versions of Firefox are those numbers?

Valid point. Just checked and you are right. Versions are all over the place.
Discard comment 31. There can't be any correlation there.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 34

•

11 years ago

Comment on attachment 816567 [details] [diff] [review]
patch_v2

Review of attachment 816567 [details] [diff] [review]:
-----------------------------------------------------------------

I'd like to defer to Henrik on this patch, as he's much more familiar with the Mozmill internals than me.

Attachment #816567 - Flags: feedback?(dave.hunt)

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 35

•

11 years ago

Attached patch patch v3.0 (obsolete) — Details — Splinter Review

Hi, I had a look on this bug yesterday, and from what I saw that we are doing differently from 1.5, is that we are not closing the application with force-quit flag, and we don't check if we had granted the access to close the application by some others(addons/modals). So I added an observer to watch if we can close the application, and if we can't I set the flag to force-quit. I tested this on all machines to be sure I don't broke anything.
Also Andrei runned tests on the machine he reproduced this failure before from the morning and we have't seen any more this failure.

Assignee: andrei.eftimie → cosmin.malutan

Attachment #816567 - Attachment is obsolete: true

Attachment #816567 - Flags: feedback?(hskupin)

Attachment #821662 - Flags: review?(hskupin)

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 36

•

11 years ago

Here are some reports:
http://mozmill-crowd.blargon7.com/#/functional/report/95c9c51e8c2d9f14917fc749dd3f4022
http://mozmill-crowd.blargon7.com/#/functional/report/95c9c51e8c2d9f14917fc749dd2e7161
http://mozmill-crowd.blargon7.com/#/functional/report/95c9c51e8c2d9f14917fc749dd35ace6

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 37

•

11 years ago

Comment on attachment 821662 [details] [diff] [review]
patch v3.0

Review of attachment 821662 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozmill/mozmill/extension/resource/modules/frame.js
@@ +80,5 @@
> +    });
> +  } catch (e) {
> +    // Close the application forcibly, if we had not been granted access to close it
> +    // by some other subprocesses so they won't hang
> +    flags |= Ci.nsIAppStartup.eForceQuit;

Oh!! That's pretty interesting. Where has this flag been gone? Probably I missed to add this in one of my larger frame.js patches. :/ Very good find!

So what we should do here is to always have the eForceQuit flag set. We can never trust any other module or add-on, which can interfere with us. So triggering a close should really close Firefox.

So please add this to the default flags at the beginning of this method. All the other stuff (observer and waitFor) should not be necessary, right?

Attachment #821662 - Flags: review?(hskupin) → review-

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 38

•

11 years ago

Attached patch patch_v3.0 — Details — Splinter Review

(In reply to Henrik Skupin (:whimboo) from comment #37)
> So please add this to the default flags at the beginning of this method. All
> the other stuff (observer and waitFor) should not be necessary, right?
Right.

Attachment #821662 - Attachment is obsolete: true

Attachment #821710 - Flags: review?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 39

•

11 years ago

Given that, I first thought we regressed in Mozmill via my patch on bug 771517. But goQuitApplication is also using eAttemptQuit. So not sure what actually caused this.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 40

•

11 years ago

Comment on attachment 821710 [details] [diff] [review]
patch_v3.0

Review of attachment 821710 [details] [diff] [review]:
-----------------------------------------------------------------

That looks reasonable. Have you run all mutt tests across platforms beside our Mozmill tests? If that's the case you get my r+

Attachment #821710 - Flags: review?(hskupin) → review+

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 41

•

11 years ago

Yes, I runned mutt tests on all platforms.

(In reply to Henrik Skupin (:whimboo) from comment #39)
> Given that, I first thought we regressed in Mozmill via my patch on bug
> 771517. But goQuitApplication is also using eAttemptQuit. So not sure what
> actually caused this.
I searched for goQuitApplication on mxr, but after I made a dump on console I see that it's also uses eAttemptQuit, but it also checks if it can close the application (canQuitApplication()). If canQuitApplication returns false, we won't close the application, and we would try to close the socket connection (self.bridge) while browser is opened, I think this is causing the WSAEINVAL error, bug 915563. I would like to take the machine offline after it starts failing to print some dumps to be sure.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 42

•

11 years ago

Yeah, that might be. But I'm not totally convinced of that, given that the WSEINVAL issue is existent for a while now. 

Cosmin, what about the testcase attached to that bug? Is it one we could make use of by a Mutt Python test? I would think so. I would feel more confident if it would be included.

Andrei Eftimie

Comment 43

•

11 years ago

+1 for including the testcase as a mutt test

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 44

•

11 years ago

Cosmin mentioned to me on IRC that for the test it would need a modal dialog during shutdown. What about the clear history one getting opened and not closed? I think that should work. Also all this sounds somewhat related to another bug we had for mozmill but we already thought it was fixed. Strange.

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

11 years ago

Component: Mozmill Tests → Mozmill

Product: Mozilla QA → Testing

Whiteboard: [mozmill-test-failure] → [mozmill-2.0.1+]

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 45

•

11 years ago

Hi Henrik, you're right that fails to, but I guess that clear history it works like print page(system modal), it prevents the browser from closing, I need to be able to close the browser and the modal to remain open (because it's unresponding). But trying to close the browser with the print/ clear history open also results in "Timeout: bridge.execFunction" / "Disconnect Error: Application unexpectedly closed"

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 46

•

11 years ago

I think we could open the modal dialog asynchronously via setTimeout() and a delay of e.g. 50ms. At the same time we are closing the browser. It would be worth a try. If it doesn't help we might want to take the test as it is even if it's not 100% reproducible before the fix.

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 47

•

11 years ago

Attached patch mutt_test (obsolete) — Details — Splinter Review

(In reply to Henrik Skupin (:whimboo) from comment #46)
> I think we could open the modal dialog asynchronously via setTimeout() and a
> delay of e.g. 50ms. At the same time we are closing the browser. It would be
> worth a try. 
That was brilliant, if I open the modal asynchronously and then call restartApplication asynchronously the modal it hangs for as long the callback runs. Unfortunately I still don't get that failure, but I did't ran the tests on a windows machine yet.
>If it doesn't help we might want to take the test as it is even
> if it's not 100% reproducible before the fix.

So I made a mutt test in python, in which I run two mozmill tests, and I check that both tests passes.
In one test I open the AddonsManager, and in one clear history.

Attachment #822590 - Flags: feedback?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 48

•

11 years ago

Comment on attachment 822590 [details] [diff] [review]
mutt_test

Review of attachment 822590 [details] [diff] [review]:
-----------------------------------------------------------------

Please see my inline comments. Otherwise please always combine patches and don't upload separate ones for the core fix and tests. Thanks.

::: mutt/mutt/tests/python/js-modules/testCloseApplication/testQuitApplicationWithAMOpen.js
@@ +20,5 @@
> +    }, "Selected category has been loaded", 30000);
> +  }
> +  finally {
> +    controller.window.document.removeEventListener("ViewChanged", onViewLoaded, false);
> +  }

What's the condition of this test to show the problem? As you said a modal dialog should be necessary. Here we don't have anything like that. I'm not certain that this test buys us that much.

::: mutt/mutt/tests/python/js-modules/testCloseApplication/testQuitApplicationWithHistoryOpen.js
@@ +14,5 @@
> +    controller.sleep(1000);
> +  }, 100);
> +
> +  controller.window.setTimeout(function () {
> +    controller.restartApplication("testTwo");

In our original tests we haven't had all the tests in a single file. So I wonder if that makes a difference in terms that you weren't able to see this problem.

@@ +18,5 @@
> +    controller.restartApplication("testTwo");
> +  }, 500);
> +}
> +
> +var testTwo = function () {}

As you can remember we had the problem that we didn't successfully run the first couple of tests in a sequence. So you might want to replicate that here and keep the status in the persisted object. An additional check at the end should make it safer for the js side.

Attachment #822590 - Flags: feedback?(hskupin) → feedback+

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 49

•

11 years ago

Hi Henrik, before I will update the patch I want to be sure I approach it correctly.

(In reply to Henrik Skupin (:whimboo) from comment #48)
> testQuitApplicationWithAMOpen.js
> @@ +20,5 @@
> > +    }, "Selected category has been loaded", 30000);
> > +  }
> > +  finally {
> > +    controller.window.document.removeEventListener("ViewChanged", onViewLoaded, false);
> > +  }
> 
> What's the condition of this test to show the problem? As you said a modal
> dialog should be necessary. Here we don't have anything like that. I'm not
> certain that this test buys us that much.
This is the testcase where it failed, if the Add-ons Manager freezes or doesn't closes for some reason we get:
>> Parent process 5032 exited with children alive:
>> PIDS: 3644
>> Attempting to kill them...
>> Parent process 5032 exited with children alive:
>> PIDS: 3644
>> Attempting to kill them...
This failure I tried to reproduce in mutt test. And it seems to be fixed if in method frame/shutdownApplication(), we close the browser with eForceQuit flag. So I would like to keep this testcase too, what do you say ?
  
> :::
> mutt/mutt/tests/python/js-modules/testCloseApplication/
> testQuitApplicationWithHistoryOpen.js
> @@ +14,5 @@
> > +    controller.sleep(1000);
> > +  }, 100);
> > +
> > +  controller.window.setTimeout(function () {
> > +    controller.restartApplication("testTwo");
> 
> In our original tests we haven't had all the tests in a single file. So I
> wonder if that makes a difference in terms that you weren't able to see this
> problem.
I testcase Andrei made they are on the same file, but I run testes from two different files so I would say that I treat both cases.
Also in python code I check that I had 2 successful tests so if one test it's missed the python assertion should fail.  
> @@ +18,5 @@
> > +    controller.restartApplication("testTwo");
> > +  }, 500);
> > +}
> > +
> > +var testTwo = function () {}
> 
> As you can remember we had the problem that we didn't successfully run the
> first couple of tests in a sequence. So you might want to replicate that
> here and keep the status in the persisted object. An additional check at the
> end should make it safer for the js side.
That sounds great, I will keep tracking in persisted object and check that all tests had been ran, and that they've been ran in the right order.

Again, my question is if I can keep the first test too, since is the on in which we reproduced the failure, and also I will have tests from two different files too ?

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 50

•

11 years ago

(In reply to Cosmin Malutan from comment #49)
> > What's the condition of this test to show the problem? As you said a modal
> > dialog should be necessary. Here we don't have anything like that. I'm not
> > certain that this test buys us that much.
>
> This is the testcase where it failed, if the Add-ons Manager freezes or
> doesn't closes for some reason we get:

I don't see how we could force this to happen. And if that happens it's clearly a bug in Firefox. The UI should never freeze. So if it is reproducible it's another issue to get fixed. I would not include this test.

> >> Parent process 5032 exited with children alive:
> >> PIDS: 3644
> >> Attempting to kill them...
> >> Parent process 5032 exited with children alive:
> >> PIDS: 3644
> >> Attempting to kill them...
> This failure I tried to reproduce in mutt test. And it seems to be fixed if
> in method frame/shutdownApplication(), we close the browser with eForceQuit
> flag. So I would like to keep this testcase too, what do you say ?

I would suggest to analyze the problem of the freeze separately and not patch around it. Mozmill should be able to close Firefox in such a case, that's true. But let us not add this case which is mostly intermittent.

> > In our original tests we haven't had all the tests in a single file. So I
> > wonder if that makes a difference in terms that you weren't able to see this
> > problem.
>
> I testcase Andrei made they are on the same file, but I run testes from two
> different files so I would say that I treat both cases.

As long as we support both cases it might be a good idea to also test them both. Thanks.

> > As you can remember we had the problem that we didn't successfully run the
> > first couple of tests in a sequence. So you might want to replicate that
> > here and keep the status in the persisted object. An additional check at the
> > end should make it safer for the js side.
>
> That sounds great, I will keep tracking in persisted object and check that
> all tests had been ran, and that they've been ran in the right order.

Sounds good.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

11 years ago

Blocks: 931828

Andrei Eftimie

Comment 51

•

11 years ago

> I would suggest to analyze the problem of the freeze separately and not patch around it.

Just to add my 2 cents here.
In the testcase attached to this bug nothing really freezes.

We wait for 180 seconds to timeout after we try to close the application but fail to close the child process (which is the socket communication) because it was still running at the time of the quit.

The 3 minutes are set here:
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L56

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 52

•

11 years ago

180s for timing out what? This is still not clear to me. And when you are sure there is no freeze, what is it? The application doesn't seem to respond, right? Also who wrote the pointed out mozprocess code? We should bring this person into this discussion. Not sure why this hasn't happened earlier here, if it's really an issue with mozprocess.

Andrei Eftimie

Comment 53

•

11 years ago

I am not sure I understand your question.
Here is the actual failure:
https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/processhandler.py#L330

See the comment above it:
> # Arbitrarily wait 3 minutes for windows to get its act together
> # Windows sometimes takes a small nap between notifying the
> # IO Completion port and actually killing the children, and we
> # don't want to mistake that situation for the situation of an unexpected
> # parent abort (which is what we're looking for here).

> Not sure why this hasn't happened earlier here, if it's really an issue with mozprocess.
I'm not sure the issue is with mozprocess, we fail to correctly shut down the app given that we still have a socket active.

The specific code has been inserted by Clint:
https://github.com/mozilla/mozbase/commit/e5165109f69a80b2720a6dc1bb54af1fa05f1a9c

Clint, do you still remember the above referenced code? What it should do and in what circumstances?

Flags: needinfo?(ctalbert)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 54

•

11 years ago

(In reply to Andrei Eftimie from comment #53)
> I'm not sure the issue is with mozprocess, we fail to correctly shut down
> the app given that we still have a socket active.

And here is the problem which you haven't really answered yet. What is preventing us from closing Firefox with the add-on manager open? There is no modal dialog, and as you said no freeze. But what else is blocking us? If you can reproduce (which I cannot) please find that out.

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 55

•

11 years ago

Attached patch mutt_test — Details — Splinter Review

I still don't understand why opening the add-ons manager prevents us from closing the runner.

I updated the mutt test, and what it does now is:
1- Set's persisted object and open addons manager - next test is in different file 
2- Checks that test1 was ran before and  close addons manager - next test is in the same file 
3- Checks that test2 was ran before  and open addons manager - next test is in the same file 
4- Checks that test3 was ran before and that test4 should be ran now.

On python side it also checks that all four test have been ran and that the last one was test4.

The mutt test failed once in a 100 runs without the fix, and it did't fail in 500 with the fix applied.

Attachment #822590 - Attachment is obsolete: true

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 56

•

11 years ago

I really don't want to hold of this bug for being fixed anytime longer. If there is no easy way to get this reproduced, I would suggest we land the fix and work on a specific test in a follow-up bug. It already takes a bit too long here.

What has to be done to wrap this bug up?

Cosmin Malutan, [:cosmin-malutan]

Assignee

Comment 57

•

11 years ago

Agreed, considering that fix is landed on master, I think we need to transplant the fix on hotfix-2.0 and raise a new follow-up bug for test.

Fix landed 6 days ago on master:
https://github.com/mozilla/mozmill/commit/e30fc0bad248792379a05a7714cabd5ac69e4aac

status-firefox26: affected → fixed

status-firefox27: affected → fixed

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 58

•

11 years ago

Oh, that push was not intended. Seems like I pushed this accidentally with the most recent one. My bad. But given that it has been landed, lets get it cherry-picked to hotfix-2.0, yes:

https://github.com/mozilla/mozmill/commit/e712dcd09fa03838c6ad8ee40f9234148bca42c9

Cosmin, please file the new bug for the test and mark as dependency here. Thanks.

Severity: normal → critical

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Flags: needinfo?(ctalbert)

Resolution: --- → FIXED

Cosmin Malutan, [:cosmin-malutan]

Assignee

Updated

•

11 years ago

Blocks: 933132

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

10 years ago

Depends on: 966235

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Testing → Testing Graveyard

testcase 1 11 years ago Andrei Eftimie 2.60 KB, patch		Details \| Diff \| Splinter Review
testcase.js 11 years ago Andrei Eftimie 866 bytes, text/javascript		Details
debug-fail.txt 11 years ago Andrei Eftimie 30.04 KB, text/plain		Details
0001-Bug-922995-Added-a-sleep-in-restartApplication-to-le.patch 11 years ago Andrei Eftimie 1.33 KB, patch		Details \| Diff \| Splinter Review
patch_v2 11 years ago Andrei Eftimie 1.99 KB, patch		Details \| Diff \| Splinter Review
patch v3.0 11 years ago Cosmin Malutan, [:cosmin-malutan] 1.92 KB, patch	whimboo : review-	Details \| Diff \| Splinter Review
patch_v3.0 11 years ago Cosmin Malutan, [:cosmin-malutan] 924 bytes, patch	whimboo : review+	Details \| Diff \| Splinter Review
mutt_test 11 years ago Cosmin Malutan, [:cosmin-malutan] 5.54 KB, patch	whimboo : feedback+	Details \| Diff \| Splinter Review
mutt_test 11 years ago Cosmin Malutan, [:cosmin-malutan] 7.75 KB, patch		Details \| Diff \| Splinter Review