1290186 - Intermittent test_end for test_windows.py TestBaseWindow.test_open_close logged while not in progress. Logged with data: {"status": "ERROR", "extra": {"class_name": "test_windows.TestBaseWindow", "method_name": "test_open_close"}

Reporter

Description

•

8 years ago

treeherder

Filed by: wkocher [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=32810288&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/cg51UQ_0S1ucih0UAlSh0Q/runs/0/artifacts/public%2Flogs%2Flive_backing.log

Comment hidden (Intermittent Failures Robot)

24 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 13
* autoland: 7
* fx-team: 4

Platform breakdown:
* linux64: 24

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1290186&startday=2016-07-28&endday=2016-07-28&tree=all

Ting-Yu Chou [:ting] (away)

Comment 2

•

8 years ago

I ran the test 20 times locally but no luck to reproduce.

I saw WriteMinidump() in the stack and thought about bug 1267329, but it doesn't seem the reason of the crash.

Ting-Yu Chou [:ting] (away)

Comment 3

•

8 years ago

04:29:24 INFO - ###!!! [Parent][MessageChannel] Error: (msgtype=0x2000E,name=PAPZ::Msg_Destroy) Channel error: cannot send/recv 

I am not sure, but does this mean somehow the message is sent too late?

Flags: needinfo?(bugmail)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 4

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #2)
> I saw WriteMinidump() in the stack and thought about bug 1267329, but it
> doesn't seem the reason of the crash.

See bug 1289243 for the originally filed bug but which now gets hidden on Treeherder.

When I try to run our tests with Ubuntu 16.04 its a permanent failure as it looks like:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d007a1512b5&selectedJob=24777458

Maybe I should grab a one click loaner and check it.

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

8 years ago

Blocks: 1290192

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

8 years ago

Comment 5

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #3)
> 04:29:24 INFO - ###!!! [Parent][MessageChannel] Error:
> (msgtype=0x2000E,name=PAPZ::Msg_Destroy) Channel error: cannot send/recv 
> 
> I am not sure, but does this mean somehow the message is sent too late?

Yeah, it means the message is sent after the channel is torn down. However this could happen for any number of reasons, like if the child process crashes then the parent process might still be in the process of sending it messages which would fail with this output.

Flags: needinfo?(bugmail)

Comment hidden (Intermittent Failures Robot)

49 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 21
* autoland: 15
* fx-team: 8
* mozilla-central: 3
* try: 2

Platform breakdown:
* linux64: 48
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1290186&startday=2016-07-29&endday=2016-07-29&tree=all

Comment hidden (Intermittent Failures Robot)

75 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 35
* autoland: 23
* fx-team: 13
* try: 2
* mozilla-central: 2

Platform breakdown:
* linux64: 74
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1290186&startday=2016-07-25&endday=2016-07-31&tree=all

Comment hidden (Intermittent Failures Robot)

22 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 9
* mozilla-inbound: 7
* mozilla-central: 2
* mozilla-aurora: 2
* try: 1
* fx-team: 1

Platform breakdown:
* linux64: 22

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1290186&startday=2016-08-01&endday=2016-08-01&tree=all

Comment hidden (Intermittent Failures Robot)

36 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 16
* autoland: 14
* fx-team: 5
* try: 1

Platform breakdown:
* linux64: 35
* android-4-3-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1290186&startday=2016-08-02&endday=2016-08-02&tree=all

Comment hidden (Intermittent Failures Robot)

44 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 24
* autoland: 9
* fx-team: 7
* mozilla-central: 4

Platform breakdown:
* linux64: 44

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1290186&startday=2016-08-03&endday=2016-08-03&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

8 years ago

Blocks: 1261842

Keywords: regression

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 11

•

8 years ago

Mike, I checked Treeherder for this failure and indeed the first time it appears is for the landing of your patch! See this link:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&fromchange=84b5a1027550&tochange=bda9fd505371&selectedJob=913100

Sadly there were infra issues for this push. So I just retriggered some more tasks to check if we can reproduce.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 12

•

8 years ago

I checked this locally and I see dozen of crashes which happen each time a new chrome window gets opened. All of them have an assertion occurring first:

> [Parent 1906] WARNING: NS_ENSURE_TRUE(document) failed: file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2004
> [Parent 1906] WARNING: NS_ENSURE_SUCCESS(rv, false) failed with result 0x80004002: file /home/worker/workspace/build/src/dom/base/nsDocument.cpp, line 5947
>  [Parent 1906] ###!!! ASSERTION: Why is there a document channel?: '!chan', file /home/worker/workspace/build/src/embedding/components/windowwatcher/nsWindowWatcher.cpp, line 2187

I think we have to file a bug under cookies to get this fixed, and mark all test failures dependent on it.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 13

•

8 years ago

To correct myself the assertion is for the window watcher, and not the cookie file.

Comment hidden (Intermittent Failures Robot)

33 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 15
* mozilla-inbound: 14
* fx-team: 4

Platform breakdown:
* linux64: 33

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1290186&startday=2016-08-04&endday=2016-08-04&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 15

•

8 years ago

Bug 1290280 might be what we see here for our tests? It says virtualization. Maybe given that our tests are run in AWS there is something related to that? At least because we cannot reproduce on our local machines. Mike, what do you think?

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 16

•

8 years ago

Currently looking into it.

Flags: needinfo?(mconley)

Comment hidden (Intermittent Failures Robot)

174 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 76
* autoland: 55
* fx-team: 24
* mozilla-central: 9
* try: 6
* mozilla-aurora: 4

Platform breakdown:
* linux64: 173
* android-4-3-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1290186&startday=2016-08-01&endday=2016-08-07&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

8 years ago

Depends on: 1290280

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 18

•

8 years ago

So I looked into this a bit, and I think I know what's going on.

test_windows.py opens and closes a bunch of windows, while a background window pointed at about:newtab just kinda hangs around.

Because bug 1261842 made initial browsers remote, this means that we're creating and destroying remote browsers pretty rapidly, which means we're spawning content processes, and then sending messages to them to shut down pretty rapidly.

In ContentParent, when we tell the content process to shut down, we also start a timer to kill that content process manually if it doesn't close itself down in time. By default, that timer is 5 seconds long.

So we're opening and closing a bunch of content processes, and we're also scheduling these things to be terminated if they don't shut down in time. In at least one case, we open and close a window so quickly that the shut down timer starts before the content process has done any of its own initialization. The parent is busy during this time, and the content process usually asks the parent for information during initialization (synchronously). This means that the content process takes longer to initialize, which means that it takes longer for it to pop the "shut down" message off the queue, which means that it's far more likely that we'll timeout and the parent will kill the content process manually.

When the parent kills the content process, it creates a paired minidump. A stack is generated for the content (which is mostly useless since the content could have been doing anything at the time of the kill) and one is generated for the parent (which shows us that we're doing a shut down kill of the content process).

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 19

•

8 years ago

The solution that whimboo and I have come up with is to set dom.ipc.tabs.shutdownTimeoutSecs to 0 for the test. This causes us to skip the content process shut down timer logic, which should get us past the orange here.

We might want to find a longer-term solution here, but this should do for now.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 20

•

8 years ago

I filed bug 1293270 to get some e10s related improvements into mozcrash.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/70026/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/70026/

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

8 years ago

Assignee: nobody → mconley

No longer blocks: 1290192

No longer depends on: 1290280

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 22

•

8 years ago

mozreview-review

Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

https://reviewboard.mozilla.org/r/70026/#review67186

<p>The changes are looking great. If try is also happy with a couple of retriggeres included for e10s-local, we can get this patch landed. Thank you Mike!</p>

::: testing/firefox-ui/tests/puppeteer/test_windows.py:29
(Diff revision 1)
> +        the content process manually, which generates a crash report,
> +        which causes these tests to orange. We side-step this by
> +        setting dom.ipc.tabs.shutdownTimeoutSecs to 0, which disables
> +        the shutdown timer.
> +        """
> +        self.prefs.set_pref('dom.ipc.tabs.shutdownTimeoutSecs', 0)

Great explanation! It will ensure that no-one will play with it.

Attachment #8778917 - Flags: review?(hskupin) → review+

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 25

•

8 years ago

Hrm, my try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=cc7a20b91544ef850d7a70d95a24491f8d5d0588 shows us still oranging here.

The stack is still the same, which suggests to me that we're still somehow killing the content process off of the 5 second timer.

whimboo - am I somehow setting the pref incorrectly?

Flags: needinfo?(hskupin)

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 26

•

8 years ago

According to https://public-artifacts.taskcluster.net/R19qUIvtQimZeAjpYFxp7A/0/public/logs/live_backing.log (I added some manual logging - search for "mconley") even with this patch, we're still reading dom.ipc.tabs.shutdownTimeoutSecs as 5.

Which suggests to me that this patch is just not properly setting the pref. Any idea what I'm doing wrong?

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 27

•

8 years ago

Could it be that this preference needs a restart of the application? It would be the only possible idea I currently have.

Flags: needinfo?(hskupin)

Comment hidden (mozreview-request)

Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/70026/diff/1-2/

Pulsebot

Comment 29

•

8 years ago

Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8f2224b00d74
Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests. r=whimboo

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 30

•

8 years ago

mozreview-review

Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

https://reviewboard.mozilla.org/r/70026/#review67544

::: testing/firefox-ui/tests/puppeteer/test_windows.py:30
(Diff revisions 1 - 2)
>          which causes these tests to orange. We side-step this by
>          setting dom.ipc.tabs.shutdownTimeoutSecs to 0, which disables
>          the shutdown timer.
>          """
> -        self.prefs.set_pref('dom.ipc.tabs.shutdownTimeoutSecs', 0)
> +        FirefoxTestCase.setUp(self)
> +        self.marionette.set_pref('dom.ipc.tabs.shutdownTimeoutSecs', 0)

Can you please explain the change here to use `self.marionette.set_pref()`? It has different behavior and the modified pref is not being reset at the end of the test! So we would need a custom tearDown method for BaseWindowTestCase.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 31

•

8 years ago

I requested a backout of the patch from autoland due to the above mentioned issue. We decided not to affect other tests than the test_windows one.

Pulsebot

Comment 32

•

8 years ago

Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/643a67968688
Backed out changeset 8f2224b00d74 on request from whimboo

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 33

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #30)
> Can you please explain the change here to use `self.marionette.set_pref()`?
> It has different behavior and the modified pref is not being reset at the
> end of the test! So we would need a custom tearDown method for
> BaseWindowTestCase.

Oh, yikes! I apologize - I shouldn't have been so cavalier. I assumed that self.marionette.set_pref was an alias for self.prefs.set_pref. I didn't realize their behaviour was different.

Out of curiosity, how does self.marionette.set_pref differ from self.marionette.enforce_gecko_prefs?

I'll switch this back to self.prefs.set_pref and re-request review.

Comment hidden (mozreview-request)

Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/70026/diff/2-3/

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

8 years ago

Attachment #8778917 - Flags: review+ → review?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 35

•

8 years ago

(In reply to Mike Conley (:mconley) - (Needinfo me!) from comment #33)
> Oh, yikes! I apologize - I shouldn't have been so cavalier. I assumed that
> self.marionette.set_pref was an alias for self.prefs.set_pref. I didn't
> realize their behaviour was different.

Yeah, puppeteer has its own prefs methods at the moment. I filed a mentored project to get rid of them. See bug 1293588.

> Out of curiosity, how does self.marionette.set_pref differ from
> self.marionette.enforce_gecko_prefs?

set_pref() will not survive a restart while with enforce_gecko_prefs they do.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 36

•

8 years ago

mozreview-review

Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

https://reviewboard.mozilla.org/r/70026/#review67660

::: testing/firefox-ui/tests/puppeteer/test_windows.py:30
(Diff revision 3)
> +        which causes these tests to orange. We side-step this by
> +        setting dom.ipc.tabs.shutdownTimeoutSecs to 0, which disables
> +        the shutdown timer.
> +        """
> +        FirefoxTestCase.setUp(self)
> +        self.prefs.set_pref('dom.ipc.tabs.shutdownTimeoutSecs', 0)

I can remember you mentioned problems that the new pref value is not accounted and we still crash. Were you able to figure it out why it happened? I miss an appropriate comment.

Attachment #8778917 - Flags: review?(hskupin) → review+

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 37

•

8 years ago

mozreview-review-reply

Comment on attachment 8778917 [details]
Bug 1290186 - Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests.

https://reviewboard.mozilla.org/r/70026/#review67660

> I can remember you mentioned problems that the new pref value is not accounted and we still crash. Were you able to figure it out why it happened? I miss an appropriate comment.

Ah, yeah, the reason was because I had erroneously named the method `setup` instead of `setUp`, so it wasn't being called correctly. I also needed to call the parent class `setUp`, otherwise `self.prefs` is not defined.

Pulsebot

Comment 38

•

8 years ago

Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d4b23db9ec3a
Workaround content process shutdown kills in Firefox UI tests by disabling timeout for some tests. r=whimboo

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 39

•

8 years ago

Ah, right. I now remember this change! Thanks for the additional information.

Carsten Book [:Tomcat]

Comment 40

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d4b23db9ec3a

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox51: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla51

Ryan VanderMeulen [:RyanVM]

Comment 41

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/26538451d272

status-firefox50: --- → fixed

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

7 years ago

Depends on: 1407949

Comment hidden (Intermittent Failures Robot)

1 failures in 623 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* try: 1

Platform breakdown:
* osx-10-10: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1290186&startday=2018-07-09&endday=2018-07-15&tree=trunk