Closed Bug 1331198 Opened 7 years ago Closed 7 years ago

Intermittent Marionette startup IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)

Categories

(Testing :: Marionette Client and Harness, defect)

Version 3
defect
Not set
normal

Tracking

(firefox52 wontfix, firefox-esr52 fixed, firefox53 fixed, firefox54 fixed, firefox55 fixed)

RESOLVED FIXED
Tracking Status
firefox52 --- wontfix
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

References

Details

(Keywords: intermittent-failure, regression, Whiteboard: [fixed by bug 1345105][stockwell fixed])

Not sure what prevented us from connecting to the server because it clearly has been started:

> 1484384659275	Marionette	DEBUG	Marionette enabled via build flag and pref
> [825] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2639
> ++DOCSHELL 0xe1fc8800 == 1 [pid = 825] [id = {cf7ccef0-5217-45d4-ac35-3bc0929e6273}]
> ++DOMWINDOW == 1 (0xe1fcb000) [pid = 825] [serial = 1] [outer = (nil)]
> ++DOMWINDOW == 2 (0xe22e2c00) [pid = 825] [serial = 2] [outer = 0xe1fcb000]
> 1484384660211	Marionette	INFO	Listening on port 2828
> 1484384660920	Marionette	DEBUG	Marionette enabled via command-line flag
For all those failures I see the following messages in the log:

[task 2017-03-01T02:20:58.989002Z] + screen -XS xvfb quit
[task 2017-03-01T02:20:58.989891Z] XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
[task 2017-03-01T02:20:58.990396Z]       after 1108 requests (1108 known processed) with 1 events remaining.
[task 2017-03-01T02:20:58.990515Z] compizconfig - Info: Backend     : ini
[task 2017-03-01T02:20:58.990589Z] compizconfig - Info: Integration : true
[task 2017-03-01T02:20:58.991938Z] compizconfig - Info: Profile     : default
[task 2017-03-01T02:20:59.090609Z] No screen session found.

Does it mean that xvfb doesn't work correctly?
Flags: needinfo?(dustin)
Hm, the above is a red herring. I actually see the same for passing jobs.
Flags: needinfo?(dustin)
So I can see this failure when running a one-click loaner. Starting Firefox takes a very long time. It means the UI is displayed after about 1 minute! First I thought that something is simply wrong with the DISPLAY setting.

But even when it is up, it behaves very sluggish. Opening the browser console only displays the content but no window borders and such. I would assume that we might still have an issue with the virtual framebuffer. I will do more checks outside of Mozharness now to see how this performs.
I can clearly reproduce this problem each time now for a fresh one-click loaner with my patch from bug 1322383 landed on autoland. The first execution of the mozharness script always end-up in a failure. When I start the wizard again and run the job all is fine.

Btw here a link to the recent failing jobs:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=d080b72ec241db12e79f1da4b08e2ed6870ab7c4&filter-searchStr=mn

Dustin, do you see any indication that something could be wrong with the desktop manager or framebuffer?
I can even see this slow startup of Firefox when I run the following steps right after the loaner has been made available:

1. Request a one-click loaner of such a failing linux32 debug job
2. Run step 2 of the run-wizard
3. Run workspace/build/application/firefox/firefox -console

While in step 3 you will notice a delay of more than 1 minute until the first messages from the Gecko log are getting printed to the console. 

Waiting some more minutes and running the steps again, doesn't show the slow startup anymore.
Flags: needinfo?(dustin)
Flags: needinfo?(garndt)
Here the complete history from Orange Factor:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1331198&startday=2017-01-01&endday=2017-03-01&tree=all

As you can see this is happening way more often for e10s builds of Firefox than non-e10s. It started on January 17th with Chris' patch from bug 1330496 being landed. But that was actually only about Windows. So not sure if that could cause a problem on Linux 32.

I did a backout of all three commits in a local branch and pushed it to try, but it's still failing for me:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0afcc7c2a083461c73683ebedafbd053001500e9

Given that my commit for the GFX sanity window handling (https://hg.mozilla.org/integration/autoland/rev/d080b72ec241) made the failure more prominent I could check if I can only apply this commit to older changesets in mozilla-central, and figure out a regression range if there is one for Firefox.
Flags: needinfo?(cpeterson)
(In reply to Henrik Skupin (:whimboo) from comment #12)
> As you can see this is happening way more often for e10s builds of Firefox
> than non-e10s. It started on January 17th with Chris' patch from bug 1330496
> being landed. But that was actually only about Windows. So not sure if that
> could cause a problem on Linux 32.

My change in bug 1330496 should not have affected Linux. It mostly modified XP/Vista-only code paths in some Windows-only files. It did touch some mochitest Python scripts, but just some simple changes that set a Windows-only environment variable:

https://hg.mozilla.org/mozilla-central/rev/5515c1da32d9
Flags: needinfo?(cpeterson)
(In reply to Henrik Skupin (:whimboo) from comment #7)
> For all those failures I see the following messages in the log:
> 
> [task 2017-03-01T02:20:58.989002Z] + screen -XS xvfb quit
> [task 2017-03-01T02:20:58.989891Z] XIO:  fatal IO error 11 (Resource
> temporarily unavailable) on X server ":0"
> [task 2017-03-01T02:20:58.990396Z]       after 1108 requests (1108 known
> processed) with 1 events remaining.
> [task 2017-03-01T02:20:58.990515Z] compizconfig - Info: Backend     : ini
> [task 2017-03-01T02:20:58.990589Z] compizconfig - Info: Integration : true
> [task 2017-03-01T02:20:58.991938Z] compizconfig - Info: Profile     : default
> [task 2017-03-01T02:20:59.090609Z] No screen session found.

This is killing xvfb during shutdown - expected.

Based on the slowdowns, I wonder if this is running out of memory.  Can you see if that's the case using the loaner?  If not, can you see what firefox is doing for that 1 minute? Perhaps with 'strace' or detailed logging?

If Firefox doesn't have any window chrome, that suggests a problem with the window manager (compiz).  That could also be due to out-of-memory errors (if copmiz is OOM'd).  Alternately, maybe compiz has some errors.

I see warnings in https://public-artifacts.taskcluster.net/f4CS5i1HSEOb5FR8rowCnA/0/public/test_info//gecko.log - are those normal?  

I see warnings in https://public-artifacts.taskcluster.net/f4CS5i1HSEOb5FR8rowCnA/0/public/test/public/xsession-errors.log too - those are probably normal, but maybe there's an issue with dbus here?
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #14)
> Based on the slowdowns, I wonder if this is running out of memory.  Can you
> see if that's the case using the loaner?  If not, can you see what firefox
> is doing for that 1 minute? Perhaps with 'strace' or detailed logging?

Yes, it is visible all the time when running the loaner based on a job as run as part of the commit given in comment 10. Beside the missing shell output when this is happening I also do not see any Firefox window coming up via VNC. I never used strace, so do you have some proposals what do do exactly?

> If Firefox doesn't have any window chrome, that suggests a problem with the
> window manager (compiz).  That could also be due to out-of-memory errors (if
> copmiz is OOM'd).  Alternately, maybe compiz has some errors.

Given that after about 2min everything is back to normal, does Compiz fixes itself?

> I see warnings in
> https://public-artifacts.taskcluster.net/f4CS5i1HSEOb5FR8rowCnA/0/public/
> test_info//gecko.log - are those normal?  

Any of this output which you can see here is past the time when Firefox is stalled during startup. Within at least the first minute I do not get any Gecko output at all. And that is reflected in this log file too. What we see here is the time until the Marionette harness bails out due to the startup timeout after 120s.

> I see warnings in
> https://public-artifacts.taskcluster.net/f4CS5i1HSEOb5FR8rowCnA/0/public/
> test/public/xsession-errors.log too - those are probably normal, but maybe
> there's an issue with dbus here?

When I check a log file of a passing job the same failures seem to appear there too:
https://public-artifacts.taskcluster.net/KiufMrqmTbSHoiRmd_4s6g/0/public/test/public/xsession-errors.log

All in all I'm not familiar with anything of xsession related. So help would be appreciated.

It's interesting that this only happens for Linux 32bit docker images but not the 64bit ones. As I can remember Joel did the most work for this. Maybe he can remember what might be different in terms of the window manager between both docker images.
Flags: needinfo?(jmaher)
Flags: needinfo?(dustin)
I don't have much more idea how to help.

Strace is pretty easy:

  strace -p <firefox pid>
Flags: needinfo?(dustin)
I don't see anything scary in the logs, but this is an interesting problem.

A few things:
1) is this 12.04 or 16.04?
2) could we verify that a small delay here prior to starting the test could help?
3) I assume there is something wrong with the startup/config, possibly dbus has issues and we need to wait for all the services to startup without connecting to dbus (or they wait for it to start then they are  available).  Possibly firefox depends on these services.

looking at dmesg would be good on startup- what if we reboot the docker container, can we reproduce this effect?
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) from comment #17)
> A few things:
> 1) is this 12.04 or 16.04?

This is all about: Ubuntu 16.04.1 LTS docker images.

> 2) could we verify that a small delay here prior to starting the test could
> help?

I don't think so. It could still cause races and I would be more in favor of a real fix. Keep in mind that each TC job even gets mozharness, the binary, and tests first. Then it unzips those files before Firefox can be started. With that we also have some extra time included after startup.

> 3) I assume there is something wrong with the startup/config, possibly dbus
> has issues and we need to wait for all the services to startup without
> connecting to dbus (or they wait for it to start then they are  available). 
> Possibly firefox depends on these services.

Maybe. I will try if I can reproduce this issue with the single commit extracted from my larger patch on bug 1322383. If that is the case I will do a regression range via try in parallel.

> looking at dmesg would be good on startup- what if we reboot the docker
> container, can we reproduce this effect?

How do I reboot the container? Is it a simple `reboot`? Am I allowed to do so? I can try out with a one-click loaner soon.
Flags: needinfo?(jmaher)
You can't reboot containers (they're not VMs).  And the docker hosts start up and are destroyed all the time.  If you got two different loaners, it's very unlikely they were on the same host.

I'm not sure how the kernel would be involved here - we're not using kdbus, as far as I know.  That said, you can see the kernel's log buffer (bearing in mind it may contain messages related to other containers) by running `dmesg` in a container.
(In reply to Dustin J. Mitchell [:dustin] from comment #16)
> Strace is pretty easy:
> 
>   strace -p <firefox pid>

I cannot use strace because it always bails out with "Operation not permitted". I also cannot use sudo, nor do I know what to do with su.

I find bug 1287099 interesting which landed January 10th on mozilla-central, and enabled Ubuntu 16.04 docker images for tests including Marionette. Joel was working on that. Maybe we indeed regressed something with this update, and didn't really notice so far.
Ok, something strange I can see is that both processes firefox and gnome-software are using nearly 100% cpu load during that time. While Firefox's cpu usage goes back to ˜11% after it started up, gnome-software stays that high for about 10 minutes even if no tests are run! So what are we doing here and why it takes that long?

Btw I believe that this behavior also causes a massive slowdown of our tests especially for debug builds.
As it looks like we start gnome-software when I run the setup wizard for the Marionette tests. It means something in mozharness triggers additional installation of software?

Waiting for gnome-software being back at normal cpu load, the start of a Firefox debug build is way faster, and I can see the window within about 30s; compared to over 1min before.
The behavior as reported in the last comment was red herring. The gnome-software indeed gets started by something else. 

# ps -ef | grep gnome-software
worker     251    33 96 22:28 pts/0    00:03:13 /usr/bin/gnome-software --gapplication-service
# ps -ef | grep 33
worker      33     1  0 22:28 pts/0    00:00:00 /usr/lib/gnome-session/gnome-session-binary
That's certainly something that could be associated with moving from 12.04 to 16.04.  Disabling that (perhaps just by uninstalling gnome-software?) is probably a good fix.
Flags: needinfo?(garndt)
I pushed a try build by moving back Marionette tests to Ubuntu 12.04:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1744d365a1d35fbe780d0671f36a1f095065aff&selectedJob=82104725

As you can see it's working fine with that version of Ubuntu. So I blame the upgrade to Ubuntu 16.04 for this kind of regression.

With the builds we see the following:

[taskcluster 2017-03-07 08:43:07.369Z] Task ID: f9c36_S4Q8C7Qm16VC0VRw
[..]
[setup 2017-03-07T08:47:03.406840Z] run-task started
[..]
[task 2017-03-07T08:47:03.413879Z] executing ['/home/worker/bin/test-linux.sh', ...
[..]
[task 2017-03-07T08:47:23.034239Z] Running: python2.7 /home/worker/workspace/mozharness/scripts/marionette.py  ...
[..]
[task 2017-03-07T08:48:55.085890Z] 08:48:55     INFO - Calling ['/home/worker/workspace/build/venv/bin/python', '-u', '/h ...
[..]
[task 2017-03-07T08:49:07.796488Z] 08:49:07     INFO - SUITE-START | Running 87 tests

It means we start the job at 8:43:07 with the setup steps and start running the Mozharness Marionette script about 3:30min later. It first will setup the virtualenv and install necessary tools. Then 1:30min later we start the Marionette harness which immediately launches Firefox, and starts the tests 12s later!! This is 12s compared to more than 120s on Ubuntu 16.04!

I assume that the time we can safe here by fixing this issue is even higher, given that gnome-software is still running a while on 16.04 beside Marionette is running tests. Having both at 100% slows us down a lot.

I will take this bug for now given that it looks like a kinda important issue for all of our test suites running the Ubuntu 16.04 docker image. Not sure where to start but I hope that Joel might be able to help me.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
The switch over to Ubuntu 16.04 actually happened on bug 1330408.
Blocks: 1330408
Keywords: regression
I found a topic related to our issue:
http://askubuntu.com/questions/789385/software-and-updater-consumes-100-cpu-in-ubuntu-16-04

I want to verify that but at the moment I'm not able to because I cannot request a one-click loaner due to massive overload of the AWS providers. I will defer further work until the Taskcluster team got this fixed.
thanks :whimboo for looking into this- I like the solution to solve the high cpu for software updater, that seems simple an hopefully easy to verify.  I will stay tuned.
Flags: needinfo?(jmaher)
Starting the command manually I see the following failures and a hang which causes the 100% cpu load:

# /usr/bin/gnome-software --gapplication-service
(gnome-software:529): GLib-GObject-WARNING **: invalid (NULL) pointer instance
(gnome-software:529): GLib-GObject-CRITICAL **: g_signal_handlers_disconnect_matched: assertion 'G_TYPE_CHECK_INSTANCE (instance)' failed
(gnome-software:529): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
(gnome-software:529): Gs-WARNING **: Failed to create permission org.freedesktop.packagekit.trigger-offline-update: Error initializing authority: Co
uld not connect: No such file or directory
(gnome-software:529): GLib-GObject-WARNING **: invalid (NULL) pointer instance
(gnome-software:529): GLib-GObject-CRITICAL **: g_signal_handlers_disconnect_matched: assertion 'G_TYPE_CHECK_INSTANCE (instance)' failed
(gnome-software:529): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
(gnome-software:529): Gs-WARNING **: Failed to create permission org.freedesktop.packagekit.trigger-offline-update: Error initializing authority: Co
uld not connect: No such file or directory
(gnome-software:529): Gs-WARNING **: failed to call gs_plugin_add_updates_historical on fwupd: Could not connect: No such file or directory

Lets see what's necessary to get those failures fixed.
I think the best solution here is to get this service disabled. I will file a new bug to get this fixed.
Depends on: 1345105
This should be fixed now by the landing of the patch on bug 1345105.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by bug 1345105]
Whiteboard: [fixed by bug 1345105] → [fixed by bug 1345105][stockwell fixed]
Product: Testing → Remote Protocol
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in before you can comment on or make changes to this bug.