Closed Bug 1290186 Opened 8 years ago Closed 8 years ago

Intermittent test_end for test_windows.py TestBaseWindow.test_open_close logged while not in progress. Logged with data: {"status": "ERROR", "extra": {"class_name": "test_windows.TestBaseWindow", "method_name": "test_open_close"}

Categories

(Testing :: Firefox UI Tests, defect)

Version 3
defect
Not set
normal

Tracking

(firefox50 fixed, firefox51 fixed)

RESOLVED FIXED
mozilla51
Tracking Status
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: mconley)

References

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

I ran the test 20 times locally but no luck to reproduce.

I saw WriteMinidump() in the stack and thought about bug 1267329, but it doesn't seem the reason of the crash.
04:29:24 INFO - ###!!! [Parent][MessageChannel] Error: (msgtype=0x2000E,name=PAPZ::Msg_Destroy) Channel error: cannot send/recv 

I am not sure, but does this mean somehow the message is sent too late?
Flags: needinfo?(bugmail)
(In reply to Ting-Yu Chou [:ting] from comment #2)
> I saw WriteMinidump() in the stack and thought about bug 1267329, but it
> doesn't seem the reason of the crash.

See bug 1289243 for the originally filed bug but which now gets hidden on Treeherder.

When I try to run our tests with Ubuntu 16.04 its a permanent failure as it looks like:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d007a1512b5&selectedJob=24777458

Maybe I should grab a one click loaner and check it.
(In reply to Ting-Yu Chou [:ting] from comment #3)
> 04:29:24 INFO - ###!!! [Parent][MessageChannel] Error:
> (msgtype=0x2000E,name=PAPZ::Msg_Destroy) Channel error: cannot send/recv 
> 
> I am not sure, but does this mean somehow the message is sent too late?

Yeah, it means the message is sent after the channel is torn down. However this could happen for any number of reasons, like if the child process crashes then the parent process might still be in the process of sending it messages which would fail with this output.
Flags: needinfo?(bugmail)
Mike, I checked Treeherder for this failure and indeed the first time it appears is for the landing of your patch! See this link:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&fromchange=84b5a1027550&tochange=bda9fd505371&selectedJob=913100

Sadly there were infra issues for this push. So I just retriggered some more tasks to check if we can reproduce.
I checked this locally and I see dozen of crashes which happen each time a new chrome window gets opened. All of them have an assertion occurring first:

> [Parent 1906] WARNING: NS_ENSURE_TRUE(document) failed: file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2004
> [Parent 1906] WARNING: NS_ENSURE_SUCCESS(rv, false) failed with result 0x80004002: file /home/worker/workspace/build/src/dom/base/nsDocument.cpp, line 5947
>  [Parent 1906] ###!!! ASSERTION: Why is there a document channel?: '!chan', file /home/worker/workspace/build/src/embedding/components/windowwatcher/nsWindowWatcher.cpp, line 2187

I think we have to file a bug under cookies to get this fixed, and mark all test failures dependent on it.
To correct myself the assertion is for the window watcher, and not the cookie file.
Bug 1290280 might be what we see here for our tests? It says virtualization. Maybe given that our tests are run in AWS there is something related to that? At least because we cannot reproduce on our local machines. Mike, what do you think?
Flags: needinfo?(mconley)
Currently looking into it.
Flags: needinfo?(mconley)
So I looked into this a bit, and I think I know what's going on.

test_windows.py opens and closes a bunch of windows, while a background window pointed at about:newtab just kinda hangs around.

Because bug 1261842 made initial browsers remote, this means that we're creating and destroying remote browsers pretty rapidly, which means we're spawning content processes, and then sending messages to them to shut down pretty rapidly.

In ContentParent, when we tell the content process to shut down, we also start a timer to kill that content process manually if it doesn't close itself down in time. By default, that timer is 5 seconds long.

So we're opening and closing a bunch of content processes, and we're also scheduling these things to be terminated if they don't shut down in time. In at least one case, we open and close a window so quickly that the shut down timer starts before the content process has done any of its own initialization. The parent is busy during this time, and the content process usually asks the parent for information during initialization (synchronously). This means that the content process takes longer to initialize, which means that it takes longer for it to pop the "shut down" message off the queue, which means that it's far more likely that we'll timeout and the parent will kill the content process manually.

When the parent kills the content process, it creates a paired minidump. A stack is generated for the content (which is mostly useless since the content could have been doing anything at the time of the kill) and one is generated for the parent (which shows us that we're doing a shut down kill of the content process).
The solution that whimboo and I have come up with is to set dom.ipc.tabs.shutdownTimeoutSecs to 0 for the test. This causes us to skip the content process shut down timer logic, which should get us past the orange here.

We might want to find a longer-term solution here, but this should do for now.
I filed bug 1293270 to get some e10s related improvements into mozcrash.
Assignee: nobody → mconley
No longer blocks: 1290192
No longer depends on: 1290280
Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

https://reviewboard.mozilla.org/r/70026/#review67186

<p>The changes are looking great. If try is also happy with a couple of retriggeres included for e10s-local, we can get this patch landed. Thank you Mike!</p>

::: testing/firefox-ui/tests/puppeteer/test_windows.py:29
(Diff revision 1)
> +        the content process manually, which generates a crash report,
> +        which causes these tests to orange. We side-step this by
> +        setting dom.ipc.tabs.shutdownTimeoutSecs to 0, which disables
> +        the shutdown timer.
> +        """
> +        self.prefs.set_pref('dom.ipc.tabs.shutdownTimeoutSecs', 0)

Great explanation! It will ensure that no-one will play with it.
Attachment #8778917 - Flags: review?(hskupin) → review+
Hrm, my try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=cc7a20b91544ef850d7a70d95a24491f8d5d0588 shows us still oranging here.

The stack is still the same, which suggests to me that we're still somehow killing the content process off of the 5 second timer.

whimboo - am I somehow setting the pref incorrectly?
Flags: needinfo?(hskupin)
According to https://public-artifacts.taskcluster.net/R19qUIvtQimZeAjpYFxp7A/0/public/logs/live_backing.log (I added some manual logging - search for "mconley") even with this patch, we're still reading dom.ipc.tabs.shutdownTimeoutSecs as 5.

Which suggests to me that this patch is just not properly setting the pref. Any idea what I'm doing wrong?
Could it be that this preference needs a restart of the application? It would be the only possible idea I currently have.
Flags: needinfo?(hskupin)
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8f2224b00d74
Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests. r=whimboo
Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

https://reviewboard.mozilla.org/r/70026/#review67544

::: testing/firefox-ui/tests/puppeteer/test_windows.py:30
(Diff revisions 1 - 2)
>          which causes these tests to orange. We side-step this by
>          setting dom.ipc.tabs.shutdownTimeoutSecs to 0, which disables
>          the shutdown timer.
>          """
> -        self.prefs.set_pref('dom.ipc.tabs.shutdownTimeoutSecs', 0)
> +        FirefoxTestCase.setUp(self)
> +        self.marionette.set_pref('dom.ipc.tabs.shutdownTimeoutSecs', 0)

Can you please explain the change here to use `self.marionette.set_pref()`? It has different behavior and the modified pref is not being reset at the end of the test! So we would need a custom tearDown method for BaseWindowTestCase.
I requested a backout of the patch from autoland due to the above mentioned issue. We decided not to affect other tests than the test_windows one.
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/643a67968688
Backed out changeset 8f2224b00d74 on request from whimboo
(In reply to Henrik Skupin (:whimboo) from comment #30)
> Can you please explain the change here to use `self.marionette.set_pref()`?
> It has different behavior and the modified pref is not being reset at the
> end of the test! So we would need a custom tearDown method for
> BaseWindowTestCase.

Oh, yikes! I apologize - I shouldn't have been so cavalier. I assumed that self.marionette.set_pref was an alias for self.prefs.set_pref. I didn't realize their behaviour was different.

Out of curiosity, how does self.marionette.set_pref differ from self.marionette.enforce_gecko_prefs?

I'll switch this back to self.prefs.set_pref and re-request review.
Attachment #8778917 - Flags: review+ → review?(hskupin)
(In reply to Mike Conley (:mconley) - (Needinfo me!) from comment #33)
> Oh, yikes! I apologize - I shouldn't have been so cavalier. I assumed that
> self.marionette.set_pref was an alias for self.prefs.set_pref. I didn't
> realize their behaviour was different.

Yeah, puppeteer has its own prefs methods at the moment. I filed a mentored project to get rid of them. See bug 1293588.

> Out of curiosity, how does self.marionette.set_pref differ from
> self.marionette.enforce_gecko_prefs?

set_pref() will not survive a restart while with enforce_gecko_prefs they do.
Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

https://reviewboard.mozilla.org/r/70026/#review67660

::: testing/firefox-ui/tests/puppeteer/test_windows.py:30
(Diff revision 3)
> +        which causes these tests to orange. We side-step this by
> +        setting dom.ipc.tabs.shutdownTimeoutSecs to 0, which disables
> +        the shutdown timer.
> +        """
> +        FirefoxTestCase.setUp(self)
> +        self.prefs.set_pref('dom.ipc.tabs.shutdownTimeoutSecs', 0)

I can remember you mentioned problems that the new pref value is not accounted and we still crash. Were you able to figure it out why it happened? I miss an appropriate comment.
Attachment #8778917 - Flags: review?(hskupin) → review+
Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

https://reviewboard.mozilla.org/r/70026/#review67660

> I can remember you mentioned problems that the new pref value is not accounted and we still crash. Were you able to figure it out why it happened? I miss an appropriate comment.

Ah, yeah, the reason was because I had erroneously named the method `setup` instead of `setUp`, so it wasn't being called correctly. I also needed to call the parent class `setUp`, otherwise `self.prefs` is not defined.
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d4b23db9ec3a
Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests. r=whimboo
Ah, right. I now remember this change! Thanks for the additional information.
https://hg.mozilla.org/mozilla-central/rev/d4b23db9ec3a
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: