Closed
Bug 1303346
Opened 8 years ago
Closed 8 years ago
[e10s] Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
Categories
(Testing :: Firefox UI Tests, defect, P2)
Tracking
(e10s+, firefox50 unaffected, firefox51 wontfix, firefox52 wontfix, firefox-esr52 wontfix, firefox53 wontfix, firefox54 wontfix, firefox55 fixed)
RESOLVED
FIXED
mozilla55
People
(Reporter: intermittent-bug-filer, Assigned: whimboo)
References
(Blocks 1 open bug)
Details
(Keywords: intermittent-failure, regression, Whiteboard: [stockwell fixed])
Assignee | ||
Comment 1•8 years ago
|
||
With the patch on bug 1301093 landed on autoland I see a new test failure popping up quite often now:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=85ec67dd11e5626a4cf0efc2abda9e3d44cb63b2&filter-searchStr=fxfn&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable
[task 2016-09-16T11:32:42.182647Z] 11:32:42 INFO - TEST-UNEXPECTED-ERROR | test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
[task 2016-09-16T11:32:42.185021Z] 11:32:42 INFO - Traceback (most recent call last):
[task 2016-09-16T11:32:42.185657Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette/marionette_test/testcases.py", line 162, in run
[task 2016-09-16T11:32:42.186575Z] 11:32:42 INFO - testMethod()
[task 2016-09-16T11:32:42.188007Z] 11:32:42 INFO - File "/home/worker/workspace/build/tests/firefox-ui/tests/puppeteer/test_toolbars.py", line 133, in test_load_url
[task 2016-09-16T11:32:42.189470Z] 11:32:42 INFO - Wait(self.marionette).until(lambda mn: mn.get_url() == data_uri)
[task 2016-09-16T11:32:42.190936Z] 11:32:42 INFO - File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
[task 2016-09-16T11:32:42.191908Z] 11:32:42 INFO - self.gen.throw(type, value, traceback)
[task 2016-09-16T11:32:42.193930Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1359, in using_context
[task 2016-09-16T11:32:42.196095Z] 11:32:42 INFO - self.set_context(scope)
[task 2016-09-16T11:32:42.197425Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1337, in set_context
[task 2016-09-16T11:32:42.198901Z] 11:32:42 INFO - self._send_message("setContext", {"value": context})
[task 2016-09-16T11:32:42.200096Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 55, in _
[task 2016-09-16T11:32:42.201889Z] 11:32:42 INFO - m.force_shutdown()
[task 2016-09-16T11:32:42.205262Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 42, in _
[task 2016-09-16T11:32:42.206901Z] 11:32:42 INFO - return func(*args, **kwargs)
[task 2016-09-16T11:32:42.207784Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 682, in _send_message
[task 2016-09-16T11:32:42.209007Z] 11:32:42 INFO - msg = self.client.request(name, params)
[task 2016-09-16T11:32:42.209919Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 274, in request
[task 2016-09-16T11:32:42.211358Z] 11:32:42 INFO - self.send(cmd)
[task 2016-09-16T11:32:42.212357Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 236, in send
[task 2016-09-16T11:32:42.213340Z] 11:32:42 INFO - self.connect()
[task 2016-09-16T11:32:42.214352Z] 11:32:42 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 214, in connect
[task 2016-09-16T11:32:42.215307Z] 11:32:42 INFO - self.sock.connect((self.addr, self.port))
[task 2016-09-16T11:32:42.216360Z] 11:32:42 INFO - File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2016-09-16T11:32:42.217297Z] 11:32:42 INFO - return getattr(self._sock,name)(*args)
[task 2016-09-16T11:32:42.219215Z] 11:32:42
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(mak77)
Assignee | ||
Comment 2•8 years ago
|
||
It seems to be e10s only.
Summary: Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused) → [e10s] Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
Comment 3•8 years ago
|
||
The difference is that before we were supposed to wait for the awesomebar results before handling Enter, but we were not doing that.
Now we should be constantly waiting, so if you type something and press Enter, we will wait for the first result to come before we handle the Enter action.
Most of the tests were expecting a content load, so it didn't require big changes to tests.
Maybe you can tell me more about the test behavior and actions?
Flags: needinfo?(mak77)
Summary: [e10s] Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused) → Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
Updated•8 years ago
|
Summary: Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused) → [e10s] Intermittent test_toolbars.py TestLocationBar.test_load_url | IOError: Process has been closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 7•8 years ago
|
||
What concerns me here is that Firefox now starts to quit somehow. Not sure what exit code -15 exactly means, but maybe its a sigterm it could be related to a content process crash? It definitely started with your patch being landed and is our top failure now.
The test itself is pretty simple:
https://dxr.mozilla.org/mozilla-central/rev/f0f15b7c6aa77a0c5750918aa0a1cb3dc82185bc/testing/firefox-ui/tests/puppeteer/test_toolbars.py#128
https://dxr.mozilla.org/mozilla-central/rev/f0f15b7c6aa77a0c5750918aa0a1cb3dc82185bc/testing/puppeteer/firefox/firefox_puppeteer/ui/browser/toolbars.py#220
We focus the locationbar, enter a data URI, and press enter immediately.
Updated•8 years ago
|
Blocks: e10s-tests
tracking-e10s:
--- → +
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
status-firefox50:
--- → unaffected
status-firefox52:
--- → affected
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 12•8 years ago
|
||
Hi ::mak,
Per comment #7, can you help shed some light here?
Flags: needinfo?(mak77)
Comment 13•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #7)
> What concerns me here is that Firefox now starts to quit somehow. Not sure
> what exit code -15 exactly means, but maybe its a sigterm it could be
> related to a content process crash?
Is the "Connection refused" error coming from marionette and meaning it lost the connection to Firefox? I was wondering if the whole thing was caused by an outside connection that the harness disallows, and iirc we bailout abruptly if we try to reach the outside worls... but if it's just marionette indicating it lost the connection to FF, it's unlikely.
I don't have a lot of ideas, as I said the only difference is that before we were basically not opening the awesomebar popup and not waiting for any results, while now we may be waiting for at least one result (maybe more are incoming) and we open the popup briefly. I didn't see any reports in the wild about crashes from the awesomebar so doesn't look like it's a known crash or similar.
May the popup confuse your code? Do we have evidence Firefox crashed, like a desktop screenshot? is there a way to get a stack of that crash?
Flags: needinfo?(mak77)
Assignee | ||
Comment 14•8 years ago
|
||
Sadly this crash only happens for opt, and pgo builds. Very rarely also for ASAN builds. But all those build types do not really give helpful information via the logs. As long as this is not reproducible I have no idea how to track this down.
Exit code -15 at least means SIGTERM, so something killed Firefox. It's clearly not Marionette client because then we do not log the exit code. Reasons could be different why Firefox gets closed.
So far I haven't seen this behavior for our tests run against nightly builds for both central and aurora. Maybe it is related to the Taskcluster docker workers which run Ubuntu 16.04 while our machines are on Ubuntu 14.04. We switched from 12.04 to 16.04 via bug 1290192 on August 3rd. But there is still more than a months until this failure has been started.
Comment 15•8 years ago
|
||
it should be possible to generate optimized build with debug symbols (ac_add_options --enable-optimize ac_add_options --enable-debug-symbols), I wonder if that may help with getting a stack from the harness?
Assignee | ||
Comment 16•8 years ago
|
||
If Firefox gets killed from the OS or another process we do not generate minidumps. Also those would not be that helpful either given that Firefox abruptly stops working at some random code.
The occurrence of this failure is still so low that it might take a while even when using a one click loaner on Taskcluster.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
I can't tell how critical this issue is to fix in 51, and we're about to go to Beta. The SIGTERM is a bit worrisome, but do we have an idea if this is somehow test only, and wouldn't show up in "real world"?
Flags: needinfo?(mak77)
OS: Unspecified → Linux
Assignee | ||
Comment 20•8 years ago
|
||
Yesterday I landed the improved crash handling for Marionette via bug 1299216. So what we would need is another occurrence of this abort. If it is a crash we should see details, if not it might be a system triggered abort of Firefox.
Comment 21•8 years ago
|
||
I don't have an answer cause I have absolutely no idea what causes this crash without a stack.
As an evidence, I didn't see crash bugs posted recently regarding the awesomebar and I didn't notice anything suspicious on crash-stats. I think crash-stats is our best bet at this point to tell if something is worrisome.
I'm leaving the ni open, to see if the new crash handling can tell us more.
Comment 22•8 years ago
|
||
I also saw 19 total failure from 10/24 to 11/21.
Comment 23•8 years ago
|
||
s/also/only
Comment 24•8 years ago
|
||
Henrik, did we get any additional info from the improved crash handling fix?
Flags: needinfo?(mak77) → needinfo?(hskupin)
Assignee | ||
Comment 25•8 years ago
|
||
Hm, the fix landed on Nov 8th, but OrangeFactor doesn't list any failures anymore. The last one is from Oct 30th. Maybe failures aren't getting starred correctly, or something fixed this issue. I will have a quick scan through Treeherder and figure out.
Flags: needinfo?(hskupin)
Assignee | ||
Comment 26•8 years ago
|
||
Actually I missed to update the target date when querying OF. So we still have a couple of test failures:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303346&startday=2016-10-24&endday=2016-11-21&tree=all
Right now I do not see any benefit from this improved crash handling. So it seems like Firefox get killed by the OS and breakpad doesn't catch it.
While checking the logs on Treeherder I noticed that we miss the gecko.log file those days. I filed bug 1319028 to bring it back. I'm sure we will get some more details there.
Depends on: 1319028
Assignee | ||
Comment 27•8 years ago
|
||
So there was a failure today:
https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler&fromchange=5ac0fb38b65024441bcc38bf899ea686c283e798&selectedJob=7181196
Log file:
https://public-artifacts.taskcluster.net/Y68Rnwp_TBS5MJlRRo0KJw/0/public/logs/live_backing.log
Failing lines:
> [task 2016-11-24T09:51:41.381944Z] 09:51:41 INFO - 1479981101376 Marionette TRACE conn50 -> [0,155,"setContext",{"value":"content"}]
> [task 2016-11-24T09:51:41.382614Z] 09:51:41 INFO - 1479981101377 Marionette TRACE conn50 <- [1,155,null,{}]
> [task 2016-11-24T09:51:41.536896Z] 09:51:41 INFO - 1479981101532 Marionette TRACE conn50 -> [0,156,"getCurrentUrl",null]
> [task 2016-11-24T09:52:41.459588Z] 09:52:41 INFO - 1479981161455 Marionette DEBUG Closed connection conn50
So the client sends a message to get the current URL. But there is no response for 60s. Usually we kill Firefox ourselves but that should result in a different failure message. So I wonder if something strange is going on with the connection and that Windows killed Firefox.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 29•8 years ago
|
||
I actually found a case where I can reproduce this failure. It is handled by bug 1305659. Maybe a fix for that bug could also solve our failure here.
Depends on: 1305659
Assignee | ||
Comment 30•8 years ago
|
||
I'm fairly sure now that the underlying issue for this problem is bug 1322277. The requested page for this test is not served correctly and usually we would run into a page load timeout in Marionette. But given that the socket timeout is shorter, Marionette kills Firefox too early.
Updated•8 years ago
|
status-firefox53:
--- → affected
Priority: -- → P2
Comment 31•8 years ago
|
||
In triage meeting: we don't think this is serious enough for blocking 51, and this is kinda late for 51. So, change the status to fixoptional.
Updated•8 years ago
|
Assignee | ||
Comment 32•8 years ago
|
||
This failure hasn't been seen for a while. In detail the last failure happened on 12 Dec 2016, 15:17. Also the depending bug 1322277 just got fixed. I think we should mark the bug as WFM and reopen if it is happening again.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Comment 33•8 years ago
|
||
Comment 34•8 years ago
|
||
we enabled search suggestions by default. I suspect that's related.
You want to set browser.urlbar.userMadeSearchSuggestionsChoice to true and browser.urlbar.suggest.searches to false.
provided these tests are not respecting prefs_general.js
Assignee | ||
Comment 35•8 years ago
|
||
Marco, is there any relationship with bug 1303781? I ask because bug 1257508 started to fail at the same time and is causes by the same Fx ui test.
Flags: needinfo?(mak77)
Comment 36•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #35)
> Marco, is there any relationship with bug 1303781? I ask because bug 1257508
> started to fail at the same time and is causes by the same Fx ui test.
Comment 34 is something you may want to do regardless (maybe in a separate bug?), because otherwise any typing in the locationbar will start a remote connection, that we don't want.
bug 1303781 could also be related, and maybe it's more likely to be the culprit (I keep forgetting that "[Errno 111] Connection refused" is not about remote connections but about marionette connection), since it changes the way the location bar buffers keyboard actions like DOWN, TAB, ENTER.
So here the test is typing a url and pressing Enter and that times out. Though doing the same manually in the browser clearly doesn't timeout. So there must be some difference in how it synthesize things.
The alternative is there may be a condition where we don't replay the delayed keys, but it won't be trivial to find out.
Flags: needinfo?(mak77)
Comment 37•8 years ago
|
||
as a side question, maybe you should make these test take a screenshot of the browser when they timeout, like mochitest-browser, at least here we could verify that the text has been properly typed in the locationbar.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 39•8 years ago
|
||
(In reply to Marco Bonardo [::mak] from comment #37)
> as a side question, maybe you should make these test take a screenshot of
> the browser when they timeout, like mochitest-browser, at least here we
> could verify that the text has been properly typed in the locationbar.
We usually take screenshots, but here it doesn't work because the browser is somehow gone. So there is nothing to screenshot.
Anyway I had a look at the logs which are now much better than before. A lot of improvements have been added recently. So here the current issue:
[task 2017-05-25T23:44:45.817328Z] 23:44:45 INFO - 1495755885813 Marionette TRACE 29 -> [0,149,"sendKeysToElement",{"text":"data:text/html,<title>Title</title>","id":"803319f9-f17b-4e40-add5-9b4fc12a8f74"}]
[task 2017-05-25T23:44:46.064663Z] 23:44:46 INFO - 1495755886063 Marionette TRACE 29 <- [1,149,null,{}]
[task 2017-05-25T23:44:46.153077Z] 23:44:46 INFO - 1495755886150 Marionette TRACE 29 -> [0,150,"getContext",{}]
[task 2017-05-25T23:44:46.154198Z] 23:44:46 INFO - 1495755886152 Marionette TRACE 29 <- [1,150,null,{"value":"chrome"}]
[task 2017-05-25T23:44:46.163142Z] 23:44:46 INFO - 1495755886161 Marionette TRACE 29 -> [0,151,"setContext",{"value":"content"}]
[task 2017-05-25T23:44:46.165804Z] 23:44:46 INFO - 1495755886163 Marionette TRACE 29 <- [1,151,null,{}]
[task 2017-05-25T23:44:46.201124Z] 23:44:46 INFO - 1495755886195 Marionette TRACE 29 -> [0,152,"getCurrentUrl",{}]
[task 2017-05-25T23:44:46.237937Z] 23:44:46 INFO - 1495755886230 Marionette TRACE 29 <- [1,152,null,{"value":"about:"}]
[task 2017-05-25T23:44:46.399875Z] 23:44:46 INFO - 1495755886394 Marionette TRACE 29 -> [0,153,"getCurrentUrl",{}]
[task 2017-05-25T23:44:46.438103Z] 23:44:46 INFO - ++DOCSHELL 0xe7dd9c00 == 1 [pid = 1024] [id = {8050bfe7-bff5-4150-a2ef-e87e3175799e}]
[task 2017-05-25T23:44:46.534616Z] 23:44:46 INFO - ++DOMWINDOW == 1 (0xe7ddd000) [pid = 1024] [serial = 1] [outer = (nil)]
[task 2017-05-25T23:44:46.837284Z] 23:44:46 INFO - ++DOMWINDOW == 2 (0xebf61800) [pid = 1024] [serial = 2] [outer = 0xe7ddd000]
[task 2017-05-25T23:44:46.877471Z] 23:44:46 INFO - 1495755886873 Marionette DEBUG loaded listener.js
`sendKeysToElement` is used to send the data to the location bar, and which triggers a page load. Then when we call `getCurrentUrl` right afterward a remotness change comes into the way which causes Marionette to hang because the now unloaded frame script cannot send the reply for `getCurrentUrl` back to the chrome process.
Andreas, I don't know why we have to ask the listener for the current URL. I would suggest to change that and simply get the value from the chrome process which also knows about it. Are you in favor of that? This should also improve the timing a lot given that the call to the content process can take a couple of seconds on slow test machines and debug builds.
Flags: needinfo?(ato)
Assignee | ||
Comment 40•8 years ago
|
||
(In reply to Marco Bonardo [::mak] from comment #34)
> we enabled search suggestions by default. I suspect that's related.
> You want to set browser.urlbar.userMadeSearchSuggestionsChoice to true and
> browser.urlbar.suggest.searches to false.
I filed bug 1368034 to get this fixed. Thanks.
Comment 41•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #39)
> Andreas, I don't know why we have to ask the listener for the current URL. I
> would suggest to change that and simply get the value from the chrome
> process which also knows about it. Are you in favor of that? This should
> also improve the timing a lot given that the call to the content process can
> take a couple of seconds on slow test machines and debug builds.
Yes, I have never understood why chrome context has to call the listener script to get the current URL. Will you make the change?
Flags: needinfo?(ato)
Assignee | ||
Comment 42•8 years ago
|
||
(In reply to Andreas Tolfsen ‹:ato› from comment #41)
> Yes, I have never understood why chrome context has to call the listener
> script to get the current URL. Will you make the change?
Great. I will do it and just filed bug 1368101.
Depends on: 1368101
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 45•8 years ago
|
||
This should be fixed with my patch on bug 1368101 which already landed on autoland. I will observe the next days.
Assignee | ||
Comment 46•8 years ago
|
||
Since my aforementioned patch got landed I haven't seen this failure on autoland anymore. Marking as fixed.
Assignee: nobody → hskupin
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Updated•8 years ago
|
Whiteboard: [stockwell fixed]
Updated•8 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Comment 48•8 years ago
|
||
In practice, we're not hitting this on ESR52. Let's just wontfix and get it off the radar.
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•