Closed Bug 1303346 Opened 6 years ago Closed 5 years ago

[e10s] Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)

Categories

(Testing :: Firefox UI Tests, defect, P2)

Version 3
Unspecified
Linux
defect

Tracking

(e10s+, firefox50 unaffected, firefox51 wontfix, firefox52 wontfix, firefox-esr52 wontfix, firefox53 wontfix, firefox54 wontfix, firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
e10s + ---
firefox50 --- unaffected
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell fixed])

With the patch on bug 1301093 landed on autoland I see a new test failure popping up quite often now:


https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=85ec67dd11e5626a4cf0efc2abda9e3d44cb63b2&filter-searchStr=fxfn&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable

[task 2016-09-16T11:32:42.182647Z] 11:32:42     INFO - TEST-UNEXPECTED-ERROR | test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
[task 2016-09-16T11:32:42.185021Z] 11:32:42     INFO - Traceback (most recent call last):
[task 2016-09-16T11:32:42.185657Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette/marionette_test/testcases.py", line 162, in run
[task 2016-09-16T11:32:42.186575Z] 11:32:42     INFO -     testMethod()
[task 2016-09-16T11:32:42.188007Z] 11:32:42     INFO -   File "/home/worker/workspace/build/tests/firefox-ui/tests/puppeteer/test_toolbars.py", line 133, in test_load_url
[task 2016-09-16T11:32:42.189470Z] 11:32:42     INFO -     Wait(self.marionette).until(lambda mn: mn.get_url() == data_uri)
[task 2016-09-16T11:32:42.190936Z] 11:32:42     INFO -   File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
[task 2016-09-16T11:32:42.191908Z] 11:32:42     INFO -     self.gen.throw(type, value, traceback)
[task 2016-09-16T11:32:42.193930Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1359, in using_context
[task 2016-09-16T11:32:42.196095Z] 11:32:42     INFO -     self.set_context(scope)
[task 2016-09-16T11:32:42.197425Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1337, in set_context
[task 2016-09-16T11:32:42.198901Z] 11:32:42     INFO -     self._send_message("setContext", {"value": context})
[task 2016-09-16T11:32:42.200096Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 55, in _
[task 2016-09-16T11:32:42.201889Z] 11:32:42     INFO -     m.force_shutdown()
[task 2016-09-16T11:32:42.205262Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 42, in _
[task 2016-09-16T11:32:42.206901Z] 11:32:42     INFO -     return func(*args, **kwargs)
[task 2016-09-16T11:32:42.207784Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 682, in _send_message
[task 2016-09-16T11:32:42.209007Z] 11:32:42     INFO -     msg = self.client.request(name, params)
[task 2016-09-16T11:32:42.209919Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 274, in request
[task 2016-09-16T11:32:42.211358Z] 11:32:42     INFO -     self.send(cmd)
[task 2016-09-16T11:32:42.212357Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 236, in send
[task 2016-09-16T11:32:42.213340Z] 11:32:42     INFO -     self.connect()
[task 2016-09-16T11:32:42.214352Z] 11:32:42     INFO -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 214, in connect
[task 2016-09-16T11:32:42.215307Z] 11:32:42     INFO -     self.sock.connect((self.addr, self.port))
[task 2016-09-16T11:32:42.216360Z] 11:32:42     INFO -   File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2016-09-16T11:32:42.217297Z] 11:32:42     INFO -     return getattr(self._sock,name)(*args)
[task 2016-09-16T11:32:42.219215Z] 11:32:42
Blocks: 1301093
Keywords: regression
Flags: needinfo?(mak77)
It seems to be e10s only.
Summary: Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused) → [e10s] Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
The difference is that before we were supposed to wait for the awesomebar results before handling Enter, but we were not doing that.
Now we should be constantly waiting, so if you type something and press Enter, we will wait for the first result to come before we handle the Enter action.
Most of the tests were expecting a content load, so it didn't require big changes to tests.

Maybe you can tell me more about the test behavior and actions?
Flags: needinfo?(mak77)
Summary: [e10s] Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused) → Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
Summary: Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused) → [e10s] Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
What concerns me here is that Firefox now starts to quit somehow. Not sure what exit code -15 exactly means, but maybe its a sigterm it could be related to a content process crash? It definitely started with your patch being landed and is our top failure now.

The test itself is pretty simple:

https://dxr.mozilla.org/mozilla-central/rev/f0f15b7c6aa77a0c5750918aa0a1cb3dc82185bc/testing/firefox-ui/tests/puppeteer/test_toolbars.py#128
https://dxr.mozilla.org/mozilla-central/rev/f0f15b7c6aa77a0c5750918aa0a1cb3dc82185bc/testing/puppeteer/firefox/firefox_puppeteer/ui/browser/toolbars.py#220

We focus the locationbar, enter a data URI, and press enter immediately.
Blocks: e10s-tests
tracking-e10s: --- → +
Hi ::mak,
Per comment #7, can you help shed some light here?
Flags: needinfo?(mak77)
(In reply to Henrik Skupin (:whimboo) from comment #7)
> What concerns me here is that Firefox now starts to quit somehow. Not sure
> what exit code -15 exactly means, but maybe its a sigterm it could be
> related to a content process crash? 

Is the "Connection refused" error coming from marionette and meaning it lost the connection to Firefox? I was wondering if the whole thing was caused by an outside connection that the harness disallows, and iirc we bailout abruptly if we try to reach the outside worls... but if it's just marionette indicating it lost the connection to FF, it's unlikely.

I don't have a lot of ideas, as I said the only difference is that before we were basically not opening the awesomebar popup and not waiting for any results, while now we may be waiting for at least one result (maybe more are incoming) and we open the popup briefly. I didn't see any reports in the wild about crashes from the awesomebar so doesn't look like it's a known crash or similar.
May the popup confuse your code? Do we have evidence Firefox crashed, like a desktop screenshot? is there a way to get a stack of that crash?
Flags: needinfo?(mak77)
Sadly this crash only happens for opt, and pgo builds. Very rarely also for ASAN builds. But all those build types do not really give helpful information via the logs. As long as this is not reproducible I have no idea how to track this down.

Exit code -15 at least means SIGTERM, so something killed Firefox. It's clearly not Marionette client because then we do not log the exit code. Reasons could be different why Firefox gets closed.

So far I haven't seen this behavior for our tests run against nightly builds for both central and aurora. Maybe it is related to the Taskcluster docker workers which run Ubuntu 16.04 while our machines are on Ubuntu 14.04. We switched from 12.04 to 16.04 via bug 1290192 on August 3rd. But there is still more than a months until this failure has been started.
it should be possible to generate optimized build with debug symbols (ac_add_options --enable-optimize ac_add_options --enable-debug-symbols), I wonder if that may help with getting a stack from the harness?
If Firefox gets killed from the OS or another process we do not generate minidumps. Also those would not be that helpful either given that Firefox abruptly stops working at some random code. 

The occurrence of this failure is still so low that it might take a while even when using a one click loaner on Taskcluster.
I can't tell how critical this issue is to fix in 51, and we're about to go to Beta.  The SIGTERM is a bit worrisome, but do we have an idea if this is somehow test only, and wouldn't show up in "real world"?
Flags: needinfo?(mak77)
OS: Unspecified → Linux
Yesterday I landed the improved crash handling for Marionette via bug 1299216. So what we would need is another occurrence of this abort. If it is a crash we should see details, if not it might be a system triggered abort of Firefox.
I don't have an answer cause I have absolutely no idea what causes this crash without a stack.
As an evidence, I didn't see crash bugs posted recently regarding the awesomebar and I didn't notice anything suspicious on crash-stats. I think crash-stats is our best bet at this point to tell if something is worrisome.
I'm leaving the ni open, to see if the new crash handling can tell us more.
I also saw 19 total failure from 10/24 to 11/21.
s/also/only
Henrik, did we get any additional info from the improved crash handling fix?
Flags: needinfo?(mak77) → needinfo?(hskupin)
Hm, the fix landed on Nov 8th, but OrangeFactor doesn't list any failures anymore. The last one is from Oct 30th. Maybe failures aren't getting starred correctly, or something fixed this issue. I will have a quick scan through Treeherder and figure out.
Flags: needinfo?(hskupin)
Actually I missed to update the target date when querying OF. So we still have a couple of test failures:

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303346&startday=2016-10-24&endday=2016-11-21&tree=all

Right now I do not see any benefit from this improved crash handling. So it seems like Firefox get killed by the OS and breakpad doesn't catch it. 

While checking the logs on Treeherder I noticed that we miss the gecko.log file those days. I filed bug 1319028 to bring it back. I'm sure we will get some more details there.
Depends on: 1319028
So there was a failure today:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler&fromchange=5ac0fb38b65024441bcc38bf899ea686c283e798&selectedJob=7181196

Log file:
https://public-artifacts.taskcluster.net/Y68Rnwp_TBS5MJlRRo0KJw/0/public/logs/live_backing.log

Failing lines:

> [task 2016-11-24T09:51:41.381944Z] 09:51:41     INFO -  1479981101376	Marionette	TRACE	conn50 -> [0,155,"setContext",{"value":"content"}]
> [task 2016-11-24T09:51:41.382614Z] 09:51:41     INFO -  1479981101377	Marionette	TRACE	conn50 <- [1,155,null,{}]
> [task 2016-11-24T09:51:41.536896Z] 09:51:41     INFO -  1479981101532	Marionette	TRACE	conn50 -> [0,156,"getCurrentUrl",null]
> [task 2016-11-24T09:52:41.459588Z] 09:52:41     INFO -  1479981161455	Marionette	DEBUG	Closed connection conn50

So the client sends a message to get the current URL. But there is no response for 60s. Usually we kill Firefox ourselves but that should result in a different failure message. So I wonder if something strange is going on with the connection and that Windows killed Firefox.
I actually found a case where I can reproduce this failure. It is handled by bug 1305659. Maybe a fix for that bug could also solve our failure here.
Depends on: 1305659
I'm fairly sure now that the underlying issue for this problem is bug 1322277. The requested page for this test is not served correctly and usually we would run into a page load timeout in Marionette. But given that the socket timeout is shorter, Marionette kills Firefox too early.
Depends on: 1322277
No longer depends on: 1305659, 1319028
Priority: -- → P2
In triage meeting: we don't think this is serious enough for blocking 51, and this is kinda late for 51. So, change the status to fixoptional.
This failure hasn't been seen for a while. In detail the last failure happened on 12 Dec 2016, 15:17. Also the depending bug 1322277 just got fixed. I think we should mark the bug as WFM and reopen if it is happening again.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
we enabled search suggestions by default. I suspect that's related.
You want to set browser.urlbar.userMadeSearchSuggestionsChoice to true and browser.urlbar.suggest.searches to false.

provided these tests are not respecting prefs_general.js
Marco, is there any relationship with bug 1303781? I ask because bug 1257508 started to fail at the same time and is causes by the same Fx ui test.
Flags: needinfo?(mak77)
(In reply to Henrik Skupin (:whimboo) from comment #35)
> Marco, is there any relationship with bug 1303781? I ask because bug 1257508
> started to fail at the same time and is causes by the same Fx ui test.

Comment 34 is something you may want to do regardless (maybe in a separate bug?), because otherwise any typing in the locationbar will start a remote connection, that we don't want.

bug 1303781 could also be related, and maybe it's more likely to be the culprit (I keep forgetting that "[Errno 111] Connection refused" is not about remote connections but about marionette connection), since it changes the way the location bar buffers keyboard actions like DOWN, TAB, ENTER.

So here the test is typing a url and pressing Enter and that times out. Though doing the same manually in the browser clearly doesn't timeout. So there must be some difference in how it synthesize things.
The alternative is there may be a condition where we don't replay the delayed keys, but it won't be trivial to find out.
Flags: needinfo?(mak77)
as a side question, maybe you should make these test take a screenshot of the browser when they timeout, like mochitest-browser, at least here we could verify that the text has been properly typed in the locationbar.
See Also: → 1257508
(In reply to Marco Bonardo [::mak] from comment #37)
> as a side question, maybe you should make these test take a screenshot of
> the browser when they timeout, like mochitest-browser, at least here we
> could verify that the text has been properly typed in the locationbar.

We usually take screenshots, but here it doesn't work because the browser is somehow gone. So there is nothing to screenshot.

Anyway I had a look at the logs which are now much better than before. A lot of improvements have been added recently. So here the current issue:

[task 2017-05-25T23:44:45.817328Z] 23:44:45     INFO -  1495755885813	Marionette	TRACE	29 -> [0,149,"sendKeysToElement",{"text":"data:text/html,<title>Title</title>","id":"803319f9-f17b-4e40-add5-9b4fc12a8f74"}]
[task 2017-05-25T23:44:46.064663Z] 23:44:46     INFO -  1495755886063	Marionette	TRACE	29 <- [1,149,null,{}]
[task 2017-05-25T23:44:46.153077Z] 23:44:46     INFO -  1495755886150	Marionette	TRACE	29 -> [0,150,"getContext",{}]
[task 2017-05-25T23:44:46.154198Z] 23:44:46     INFO -  1495755886152	Marionette	TRACE	29 <- [1,150,null,{"value":"chrome"}]
[task 2017-05-25T23:44:46.163142Z] 23:44:46     INFO -  1495755886161	Marionette	TRACE	29 -> [0,151,"setContext",{"value":"content"}]
[task 2017-05-25T23:44:46.165804Z] 23:44:46     INFO -  1495755886163	Marionette	TRACE	29 <- [1,151,null,{}]
[task 2017-05-25T23:44:46.201124Z] 23:44:46     INFO -  1495755886195	Marionette	TRACE	29 -> [0,152,"getCurrentUrl",{}]
[task 2017-05-25T23:44:46.237937Z] 23:44:46     INFO -  1495755886230	Marionette	TRACE	29 <- [1,152,null,{"value":"about:"}]
[task 2017-05-25T23:44:46.399875Z] 23:44:46     INFO -  1495755886394	Marionette	TRACE	29 -> [0,153,"getCurrentUrl",{}]
[task 2017-05-25T23:44:46.438103Z] 23:44:46     INFO -  ++DOCSHELL 0xe7dd9c00 == 1 [pid = 1024] [id = {8050bfe7-bff5-4150-a2ef-e87e3175799e}]
[task 2017-05-25T23:44:46.534616Z] 23:44:46     INFO -  ++DOMWINDOW == 1 (0xe7ddd000) [pid = 1024] [serial = 1] [outer = (nil)]
[task 2017-05-25T23:44:46.837284Z] 23:44:46     INFO -  ++DOMWINDOW == 2 (0xebf61800) [pid = 1024] [serial = 2] [outer = 0xe7ddd000]
[task 2017-05-25T23:44:46.877471Z] 23:44:46     INFO -  1495755886873	Marionette	DEBUG	loaded listener.js

`sendKeysToElement` is used to send the data to the location bar, and which triggers a page load. Then when we call `getCurrentUrl` right afterward a remotness change comes into the way which causes Marionette to hang because the now unloaded frame script cannot send the reply for `getCurrentUrl` back to the chrome process.

Andreas, I don't know why we have to ask the listener for the current URL. I would suggest to change that and simply get the value from the chrome process which also knows about it. Are you in favor of that? This should also improve the timing a lot given that the call to the content process can take a couple of seconds on slow test machines and debug builds.
Flags: needinfo?(ato)
(In reply to Marco Bonardo [::mak] from comment #34)
> we enabled search suggestions by default. I suspect that's related.
> You want to set browser.urlbar.userMadeSearchSuggestionsChoice to true and
> browser.urlbar.suggest.searches to false.

I filed bug 1368034 to get this fixed. Thanks.
(In reply to Henrik Skupin (:whimboo) from comment #39)
> Andreas, I don't know why we have to ask the listener for the current URL. I
> would suggest to change that and simply get the value from the chrome
> process which also knows about it. Are you in favor of that? This should
> also improve the timing a lot given that the call to the content process can
> take a couple of seconds on slow test machines and debug builds.

Yes, I have never understood why chrome context has to call the listener script to get the current URL.  Will you make the change?
Flags: needinfo?(ato)
(In reply to Andreas Tolfsen ‹:ato› from comment #41)
> Yes, I have never understood why chrome context has to call the listener
> script to get the current URL.  Will you make the change?

Great. I will do it and just filed bug 1368101.
Depends on: 1368101
This should be fixed with my patch on bug 1368101 which already landed on autoland. I will observe the next days.
Since my aforementioned patch got landed I haven't seen this failure on autoland anymore. Marking as fixed.
Assignee: nobody → hskupin
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Whiteboard: [stockwell fixed]
In practice, we're not hitting this on ESR52. Let's just wontfix and get it off the radar.
You need to log in before you can comment on or make changes to this bug.