1331198 - Intermittent Marionette startup IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)

Reporter

Description

•

7 years ago

treeherder

Filed by: philringnalda [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=69031171&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/a8J0Rpw8Qf6NCIYxjdHOBg/runs/0/artifacts/public/logs/live_backing.log

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 1

•

7 years ago

Not sure what prevented us from connecting to the server because it clearly has been started:

> 1484384659275	Marionette	DEBUG	Marionette enabled via build flag and pref
> [825] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2639
> ++DOCSHELL 0xe1fc8800 == 1 [pid = 825] [id = {cf7ccef0-5217-45d4-ac35-3bc0929e6273}]
> ++DOMWINDOW == 1 (0xe1fcb000) [pid = 825] [serial = 1] [outer = (nil)]
> ++DOMWINDOW == 2 (0xe22e2c00) [pid = 825] [serial = 2] [outer = 0xe1fcb000]
> 1484384660211	Marionette	INFO	Listening on port 2828
> 1484384660920	Marionette	DEBUG	Marionette enabled via command-line flag

Comment hidden (Intermittent Failures Robot)

12 failures in 690 pushes (0.017 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 6
* mozilla-inbound: 4
* mozilla-central: 2

Platform breakdown:
* linux32: 12

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1331198&startday=2017-01-16&endday=2017-01-22&tree=all

Comment hidden (Intermittent Failures Robot)

12 failures in 749 pushes (0.016 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 7
* mozilla-inbound: 4
* try: 1

Platform breakdown:
* linux32: 12

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1331198&startday=2017-01-23&endday=2017-01-29&tree=all

Comment hidden (Intermittent Failures Robot)

22 failures in 733 pushes (0.03 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 11
* autoland: 8
* mozilla-central: 2
* graphics: 1

Platform breakdown:
* linux32: 22

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1331198&startday=2017-01-30&endday=2017-02-05&tree=all

Comment hidden (Intermittent Failures Robot)

13 failures in 836 pushes (0.016 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* autoland: 7
* mozilla-inbound: 3
* mozilla-central: 2
* try: 1

Platform breakdown:
* linux32: 13

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1331198&startday=2017-02-06&endday=2017-02-12&tree=all

Comment hidden (Intermittent Failures Robot)

10 failures in 833 pushes (0.012 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-inbound: 5
* autoland: 3
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* linux32: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1331198&startday=2017-02-13&endday=2017-02-19&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 7

•

7 years ago

For all those failures I see the following messages in the log:

[task 2017-03-01T02:20:58.989002Z] + screen -XS xvfb quit
[task 2017-03-01T02:20:58.989891Z] XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
[task 2017-03-01T02:20:58.990396Z]       after 1108 requests (1108 known processed) with 1 events remaining.
[task 2017-03-01T02:20:58.990515Z] compizconfig - Info: Backend     : ini
[task 2017-03-01T02:20:58.990589Z] compizconfig - Info: Integration : true
[task 2017-03-01T02:20:58.991938Z] compizconfig - Info: Profile     : default
[task 2017-03-01T02:20:59.090609Z] No screen session found.

Does it mean that xvfb doesn't work correctly?

Flags: needinfo?(dustin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 8

•

7 years ago

Hm, the above is a red herring. I actually see the same for passing jobs.

Flags: needinfo?(dustin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 9

•

7 years ago

So I can see this failure when running a one-click loaner. Starting Firefox takes a very long time. It means the UI is displayed after about 1 minute! First I thought that something is simply wrong with the DISPLAY setting.

But even when it is up, it behaves very sluggish. Opening the browser console only displays the content but no window borders and such. I would assume that we might still have an issue with the virtual framebuffer. I will do more checks outside of Mozharness now to see how this performs.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 10

•

7 years ago

I can clearly reproduce this problem each time now for a fresh one-click loaner with my patch from bug 1322383 landed on autoland. The first execution of the mozharness script always end-up in a failure. When I start the wizard again and run the job all is fine.

Btw here a link to the recent failing jobs:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=d080b72ec241db12e79f1da4b08e2ed6870ab7c4&filter-searchStr=mn

Dustin, do you see any indication that something could be wrong with the desktop manager or framebuffer?

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 11

•

7 years ago

I can even see this slow startup of Firefox when I run the following steps right after the loaner has been made available:

1. Request a one-click loaner of such a failing linux32 debug job
2. Run step 2 of the run-wizard
3. Run workspace/build/application/firefox/firefox -console

While in step 3 you will notice a delay of more than 1 minute until the first messages from the Gecko log are getting printed to the console. 

Waiting some more minutes and running the steps again, doesn't show the slow startup anymore.

Flags: needinfo?(dustin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Flags: needinfo?(garndt)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 12

•

7 years ago

Here the complete history from Orange Factor:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1331198&startday=2017-01-01&endday=2017-03-01&tree=all

As you can see this is happening way more often for e10s builds of Firefox than non-e10s. It started on January 17th with Chris' patch from bug 1330496 being landed. But that was actually only about Windows. So not sure if that could cause a problem on Linux 32.

I did a backout of all three commits in a local branch and pushed it to try, but it's still failing for me:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0afcc7c2a083461c73683ebedafbd053001500e9

Given that my commit for the GFX sanity window handling (https://hg.mozilla.org/integration/autoland/rev/d080b72ec241) made the failure more prominent I could check if I can only apply this commit to older changesets in mozilla-central, and figure out a regression range if there is one for Firefox.

Flags: needinfo?(cpeterson)

Chris Peterson [:cpeterson]

Comment 13

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #12)
> As you can see this is happening way more often for e10s builds of Firefox
> than non-e10s. It started on January 17th with Chris' patch from bug 1330496
> being landed. But that was actually only about Windows. So not sure if that
> could cause a problem on Linux 32.

My change in bug 1330496 should not have affected Linux. It mostly modified XP/Vista-only code paths in some Windows-only files. It did touch some mochitest Python scripts, but just some simple changes that set a Windows-only environment variable:

https://hg.mozilla.org/mozilla-central/rev/5515c1da32d9

Flags: needinfo?(cpeterson)

Dustin J. Mitchell [:dustin] (he/him)

Comment 14

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #7)
> For all those failures I see the following messages in the log:
> 
> [task 2017-03-01T02:20:58.989002Z] + screen -XS xvfb quit
> [task 2017-03-01T02:20:58.989891Z] XIO:  fatal IO error 11 (Resource
> temporarily unavailable) on X server ":0"
> [task 2017-03-01T02:20:58.990396Z]       after 1108 requests (1108 known
> processed) with 1 events remaining.
> [task 2017-03-01T02:20:58.990515Z] compizconfig - Info: Backend     : ini
> [task 2017-03-01T02:20:58.990589Z] compizconfig - Info: Integration : true
> [task 2017-03-01T02:20:58.991938Z] compizconfig - Info: Profile     : default
> [task 2017-03-01T02:20:59.090609Z] No screen session found.

This is killing xvfb during shutdown - expected.

Based on the slowdowns, I wonder if this is running out of memory.  Can you see if that's the case using the loaner?  If not, can you see what firefox is doing for that 1 minute? Perhaps with 'strace' or detailed logging?

If Firefox doesn't have any window chrome, that suggests a problem with the window manager (compiz).  That could also be due to out-of-memory errors (if copmiz is OOM'd).  Alternately, maybe compiz has some errors.

I see warnings in https://public-artifacts.taskcluster.net/f4CS5i1HSEOb5FR8rowCnA/0/public/test_info//gecko.log - are those normal?  

I see warnings in https://public-artifacts.taskcluster.net/f4CS5i1HSEOb5FR8rowCnA/0/public/test/public/xsession-errors.log too - those are probably normal, but maybe there's an issue with dbus here?

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

7 years ago

Flags: needinfo?(dustin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 15

•

7 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #14)
> Based on the slowdowns, I wonder if this is running out of memory.  Can you
> see if that's the case using the loaner?  If not, can you see what firefox
> is doing for that 1 minute? Perhaps with 'strace' or detailed logging?

Yes, it is visible all the time when running the loaner based on a job as run as part of the commit given in comment 10. Beside the missing shell output when this is happening I also do not see any Firefox window coming up via VNC. I never used strace, so do you have some proposals what do do exactly?

> If Firefox doesn't have any window chrome, that suggests a problem with the
> window manager (compiz).  That could also be due to out-of-memory errors (if
> copmiz is OOM'd).  Alternately, maybe compiz has some errors.

Given that after about 2min everything is back to normal, does Compiz fixes itself?

> I see warnings in
> https://public-artifacts.taskcluster.net/f4CS5i1HSEOb5FR8rowCnA/0/public/
> test_info//gecko.log - are those normal?  

Any of this output which you can see here is past the time when Firefox is stalled during startup. Within at least the first minute I do not get any Gecko output at all. And that is reflected in this log file too. What we see here is the time until the Marionette harness bails out due to the startup timeout after 120s.

> I see warnings in
> https://public-artifacts.taskcluster.net/f4CS5i1HSEOb5FR8rowCnA/0/public/
> test/public/xsession-errors.log too - those are probably normal, but maybe
> there's an issue with dbus here?

When I check a log file of a passing job the same failures seem to appear there too:
https://public-artifacts.taskcluster.net/KiufMrqmTbSHoiRmd_4s6g/0/public/test/public/xsession-errors.log

All in all I'm not familiar with anything of xsession related. So help would be appreciated.

It's interesting that this only happens for Linux 32bit docker images but not the 64bit ones. As I can remember Joel did the most work for this. Maybe he can remember what might be different in terms of the window manager between both docker images.

Flags: needinfo?(jmaher)

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 16

•

7 years ago

I don't have much more idea how to help.

Strace is pretty easy:

  strace -p <firefox pid>

Flags: needinfo?(dustin)

Joel Maher ( :jmaher ) (UTC -8)

Comment 17

•

7 years ago

I don't see anything scary in the logs, but this is an interesting problem.

A few things:
1) is this 12.04 or 16.04?
2) could we verify that a small delay here prior to starting the test could help?
3) I assume there is something wrong with the startup/config, possibly dbus has issues and we need to wait for all the services to startup without connecting to dbus (or they wait for it to start then they are  available).  Possibly firefox depends on these services.

looking at dmesg would be good on startup- what if we reboot the docker container, can we reproduce this effect?

Flags: needinfo?(jmaher)

Comment hidden (Intermittent Failures Robot)

17 failures in 783 pushes (0.022 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* autoland: 12
* mozilla-central: 2
* oak: 1
* mozilla-inbound: 1
* graphics: 1

Platform breakdown:
* linux32: 17

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1331198&startday=2017-02-27&endday=2017-03-05&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 19

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #17)
> A few things:
> 1) is this 12.04 or 16.04?

This is all about: Ubuntu 16.04.1 LTS docker images.

> 2) could we verify that a small delay here prior to starting the test could
> help?

I don't think so. It could still cause races and I would be more in favor of a real fix. Keep in mind that each TC job even gets mozharness, the binary, and tests first. Then it unzips those files before Firefox can be started. With that we also have some extra time included after startup.

> 3) I assume there is something wrong with the startup/config, possibly dbus
> has issues and we need to wait for all the services to startup without
> connecting to dbus (or they wait for it to start then they are  available). 
> Possibly firefox depends on these services.

Maybe. I will try if I can reproduce this issue with the single commit extracted from my larger patch on bug 1322383. If that is the case I will do a regression range via try in parallel.

> looking at dmesg would be good on startup- what if we reboot the docker
> container, can we reproduce this effect?

How do I reboot the container? Is it a simple `reboot`? Am I allowed to do so? I can try out with a one-click loaner soon.

Flags: needinfo?(jmaher)

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

7 years ago

You can't reboot containers (they're not VMs).  And the docker hosts start up and are destroyed all the time.  If you got two different loaners, it's very unlikely they were on the same host.

I'm not sure how the kernel would be involved here - we're not using kdbus, as far as I know.  That said, you can see the kernel's log buffer (bearing in mind it may contain messages related to other containers) by running `dmesg` in a container.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 21

•

7 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #16)
> Strace is pretty easy:
> 
>   strace -p <firefox pid>

I cannot use strace because it always bails out with "Operation not permitted". I also cannot use sudo, nor do I know what to do with su.

I find bug 1287099 interesting which landed January 10th on mozilla-central, and enabled Ubuntu 16.04 docker images for tests including Marionette. Joel was working on that. Maybe we indeed regressed something with this update, and didn't really notice so far.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 22

•

7 years ago

Ok, something strange I can see is that both processes firefox and gnome-software are using nearly 100% cpu load during that time. While Firefox's cpu usage goes back to ˜11% after it started up, gnome-software stays that high for about 10 minutes even if no tests are run! So what are we doing here and why it takes that long?

Btw I believe that this behavior also causes a massive slowdown of our tests especially for debug builds.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 23

•

7 years ago

As it looks like we start gnome-software when I run the setup wizard for the Marionette tests. It means something in mozharness triggers additional installation of software?

Waiting for gnome-software being back at normal cpu load, the start of a Firefox debug build is way faster, and I can see the window within about 30s; compared to over 1min before.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 24

•

7 years ago

The behavior as reported in the last comment was red herring. The gnome-software indeed gets started by something else. 

# ps -ef | grep gnome-software
worker     251    33 96 22:28 pts/0    00:03:13 /usr/bin/gnome-software --gapplication-service
# ps -ef | grep 33
worker      33     1  0 22:28 pts/0    00:00:00 /usr/lib/gnome-session/gnome-session-binary

Dustin J. Mitchell [:dustin] (he/him)

Comment 25

•

7 years ago

That's certainly something that could be associated with moving from 12.04 to 16.04.  Disabling that (perhaps just by uninstalling gnome-software?) is probably a good fix.

Greg Arndt [:garndt]

Updated

•

7 years ago

Flags: needinfo?(garndt)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 26

•

7 years ago

I pushed a try build by moving back Marionette tests to Ubuntu 12.04:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1744d365a1d35fbe780d0671f36a1f095065aff&selectedJob=82104725

As you can see it's working fine with that version of Ubuntu. So I blame the upgrade to Ubuntu 16.04 for this kind of regression.

With the builds we see the following:

[taskcluster 2017-03-07 08:43:07.369Z] Task ID: f9c36_S4Q8C7Qm16VC0VRw
[..]
[setup 2017-03-07T08:47:03.406840Z] run-task started
[..]
[task 2017-03-07T08:47:03.413879Z] executing ['/home/worker/bin/test-linux.sh', ...
[..]
[task 2017-03-07T08:47:23.034239Z] Running: python2.7 /home/worker/workspace/mozharness/scripts/marionette.py  ...
[..]
[task 2017-03-07T08:48:55.085890Z] 08:48:55     INFO - Calling ['/home/worker/workspace/build/venv/bin/python', '-u', '/h ...
[..]
[task 2017-03-07T08:49:07.796488Z] 08:49:07     INFO - SUITE-START | Running 87 tests

It means we start the job at 8:43:07 with the setup steps and start running the Mozharness Marionette script about 3:30min later. It first will setup the virtualenv and install necessary tools. Then 1:30min later we start the Marionette harness which immediately launches Firefox, and starts the tests 12s later!! This is 12s compared to more than 120s on Ubuntu 16.04!

I assume that the time we can safe here by fixing this issue is even higher, given that gnome-software is still running a while on 16.04 beside Marionette is running tests. Having both at 100% slows us down a lot.

I will take this bug for now given that it looks like a kinda important issue for all of our test suites running the Ubuntu 16.04 docker image. Not sure where to start but I hope that Joel might be able to help me.

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 27

•

7 years ago

The switch over to Ubuntu 16.04 actually happened on bug 1330408.

Blocks: 1330408

Keywords: regression

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 28

•

7 years ago

I found a topic related to our issue:
http://askubuntu.com/questions/789385/software-and-updater-consumes-100-cpu-in-ubuntu-16-04

I want to verify that but at the moment I'm not able to because I cannot request a one-click loaner due to massive overload of the AWS providers. I will defer further work until the Taskcluster team got this fixed.

Joel Maher ( :jmaher ) (UTC -8)

Comment 29

•

7 years ago

thanks :whimboo for looking into this- I like the solution to solve the high cpu for software updater, that seems simple an hopefully easy to verify.  I will stay tuned.

Flags: needinfo?(jmaher)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 30

•

7 years ago

Starting the command manually I see the following failures and a hang which causes the 100% cpu load:

# /usr/bin/gnome-software --gapplication-service
(gnome-software:529): GLib-GObject-WARNING **: invalid (NULL) pointer instance
(gnome-software:529): GLib-GObject-CRITICAL **: g_signal_handlers_disconnect_matched: assertion 'G_TYPE_CHECK_INSTANCE (instance)' failed
(gnome-software:529): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
(gnome-software:529): Gs-WARNING **: Failed to create permission org.freedesktop.packagekit.trigger-offline-update: Error initializing authority: Co
uld not connect: No such file or directory
(gnome-software:529): GLib-GObject-WARNING **: invalid (NULL) pointer instance
(gnome-software:529): GLib-GObject-CRITICAL **: g_signal_handlers_disconnect_matched: assertion 'G_TYPE_CHECK_INSTANCE (instance)' failed
(gnome-software:529): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
(gnome-software:529): Gs-WARNING **: Failed to create permission org.freedesktop.packagekit.trigger-offline-update: Error initializing authority: Co
uld not connect: No such file or directory
(gnome-software:529): Gs-WARNING **: failed to call gs_plugin_add_updates_historical on fwupd: Could not connect: No such file or directory

Lets see what's necessary to get those failures fixed.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 31

•

7 years ago

I think the best solution here is to get this service disabled. I will file a new bug to get this fixed.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Depends on: 1345105

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 32

•

7 years ago

This should be fixed now by the landing of the patch on bug 1345105.

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox52: --- → wontfix

status-firefox53: --- → affected

status-firefox54: --- → affected

status-firefox55: --- → fixed

status-firefox-esr52: --- → affected

Resolution: --- → FIXED

Whiteboard: [fixed by bug 1345105]

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [fixed by bug 1345105] → [fixed by bug 1345105][stockwell fixed]

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox53: affected → fixed

status-firefox54: affected → fixed

status-firefox-esr52: affected → fixed

BMO Automation

Updated

•

1 year ago

Product: Testing → Remote Protocol

BMO Automation

Comment 33

•

1 year ago

Moving bug to Testing::Marionette Client and Harness component per bug 1815831.

Component: Marionette → Marionette Client and Harness

Product: Remote Protocol → Testing