Closed Bug 1397201 Opened 4 years ago Closed 4 years ago

Intermittent mochitest, browser-chrome, devtools-chrome TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 330 or 370 seconds with no output

Categories

(Testing :: General, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:infra])

Attachments

(1 file)

This seems to be a timeout on startup:

[task 2017-09-08T07:21:57.947870Z] 07:21:57     INFO - TEST-INFO | started process GECKO(1285)
[task 2017-09-08T07:22:03.444551Z] 07:22:03     INFO - GECKO(1285) | 1504855323432	Marionette	INFO	Enabled via --marionette
[task 2017-09-08T07:22:10.537304Z] 07:22:10     INFO - GECKO(1285) | 1504855330531	Marionette	INFO	Listening on port 2828
[task 2017-09-08T07:27:40.548979Z] 07:27:40     INFO - Buffered messages finished
[task 2017-09-08T07:27:40.549535Z] 07:27:40    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 330 seconds with no output

Screenshots show Firefox running, new tab with no url, no activity.

Initial failures were Windows 10, but most failures are Linux now.

Ideas?
Flags: needinfo?(jmaher)
Whiteboard: [stockwell needswork]
odd, these are jobs that have been active on win10 for a while, not new jobs I added this week.

Possibly :grenade has updated the win10 vm image we use in the last 2 days?
Flags: needinfo?(jmaher) → needinfo?(rthijssen)
These failures did seem to start on Windows, so I am interested in :grenade's response, but there are a lot of Linux failures too.


Many have explicit marionette failures, like:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=129471154&lineNumber=1978

[task 2017-09-08T07:54:40.740931Z] 07:54:40    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 330 seconds with no output
[task 2017-09-08T07:54:40.741927Z] 07:54:40    ERROR - Force-terminating active process(es).
[task 2017-09-08T07:54:40.743462Z] 07:54:40     INFO - Determining child pids from psutil
[task 2017-09-08T07:54:40.751692Z] 07:54:40     INFO - Found child pids: [1277]
[task 2017-09-08T07:54:40.752529Z] 07:54:40     INFO - Killing process: 1277
[task 2017-09-08T07:54:40.753336Z] 07:54:40     INFO - TEST-INFO | started process screentopng
[task 2017-09-08T07:54:41.032381Z] 07:54:41     INFO - TEST-INFO | screentopng: exit 0
[task 2017-09-08T07:54:41.251411Z] 07:54:41     INFO - psutil found pid 1277 dead
[task 2017-09-08T07:54:41.251833Z] 07:54:41     INFO - Killing process: 1221
[task 2017-09-08T07:54:41.251948Z] 07:54:41     INFO - Not taking screenshot here: see the one that was previously logged
[task 2017-09-08T07:54:41.368927Z] 07:54:41     INFO -  Traceback (most recent call last):
[task 2017-09-08T07:54:41.370223Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2635, in doTests
[task 2017-09-08T07:54:41.374508Z] 07:54:41     INFO -      marionette_args=marionette_args,
[task 2017-09-08T07:54:41.378399Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2165, in runApp
[task 2017-09-08T07:54:41.380868Z] 07:54:41     INFO -      self.marionette.start_session(timeout=port_timeout)
[task 2017-09-08T07:54:41.382861Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-09-08T07:54:41.384733Z] 07:54:41     INFO -      m._handle_socket_failure()
[task 2017-09-08T07:54:41.386222Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-09-08T07:54:41.387102Z] 07:54:41     INFO -      return func(*args, **kwargs)
[task 2017-09-08T07:54:41.388949Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1222, in start_session
[task 2017-09-08T07:54:41.389511Z] 07:54:41     INFO -      resp = self._send_message("newSession", body)
[task 2017-09-08T07:54:41.390247Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-09-08T07:54:41.390819Z] 07:54:41     INFO -      m._handle_socket_failure()
[task 2017-09-08T07:54:41.391426Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-09-08T07:54:41.391956Z] 07:54:41     INFO -      return func(*args, **kwargs)
[task 2017-09-08T07:54:41.392665Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 731, in _send_message
[task 2017-09-08T07:54:41.393292Z] 07:54:41     INFO -      msg = self.client.request(name, params)
[task 2017-09-08T07:54:41.394240Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 284, in request
[task 2017-09-08T07:54:41.394787Z] 07:54:41     INFO -      return self.receive()
[task 2017-09-08T07:54:41.395457Z] 07:54:41     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 184, in receive
[task 2017-09-08T07:54:41.396770Z] 07:54:41     INFO -      raise socket.error("No data received over socket")
[task 2017-09-08T07:54:41.397446Z] 07:54:41     INFO -  error: No data received over socket
[task 2017-09-08T07:54:41.398218Z] 07:54:41    ERROR - Automation Error: Received unexpected exception while running application

Maybe :whimboo can offer insight?
Flags: needinfo?(hskupin)
Sounds somewhat similar to bug 1362293 where the client doesn't continue to try to connect. I'm currently working on some improvements which might help. See latest comments on bug 1362293.
Flags: needinfo?(hskupin)
the win 10 ami was indeed updated but only to change the way in which the ebs drive is mounted. it's difficult to see how it would affect only specific tests since if it had gone badly we shouldn't have a z: drive, which would break everything.
Flags: needinfo?(rthijssen)
the issue seems to have "resolved" itself, as there have been no instance in the last 1.5 days- maybe we wait a few days.
I requested a one click loaner of type t-w1064-ix so we can get this better investigated.
Rob mentioned on a different bug that in case of a failure there is only a white screen but no Firefox window present. Rob, is that the desktop background, or just an empty rect which would hold the Firefox window? I wonder if that is the same underlying issue for all the affected harnesses.
Flags: needinfo?(rthijssen)
it's not the desktop background, but it is the dimensions of the desktop (1280 x 1024) rather than the size of the browser window. i've not seen it before. it's odd that it's white. we had black screenshots when we had desktop issues before.
https://public-artifacts.taskcluster.net/IePb-o_dTW-saX_kuev6uA/0/public/test_info/mozilla-test-fail-screenshot_s9qv3o.png
Flags: needinfo?(rthijssen)
With the move of mochitest-gl tests from Buildbot to Taskcluster the number of failures has been drastically reduced!
Depends on: 1403484
Sorry but I don't have the time to further check what's up with this failure.
So lately the Marionette server gets initialized and started. But somehow the first command is not getting executed. So maybe there is some suspicious behavior on the client side which doesn't allow us to continue. Bug 1394381 would help here so that we would see at least some more logging output. I can try to prioritize that work a bit.
Component: General Automation → General
Depends on: 1394381
Product: Release Engineering → Firefox
QA Contact: catlee
Duplicate of this bug: 1399712
Duplicate of this bug: 1404866
Duplicate of this bug: 1404879
Duplicate of this bug: 1404878
4 new bugs, total of about 150 failures in the last week.
Lately the failures on bug 1407390 have been increased a lot too. Given that the other bug also happens on Windows 10 only and is directly under Marionette, I might have a better chance to get this investigated. Marking it as blocker for now.
Depends on: 1407390
Joel, to get closer to the problem I would like to enable trace logging for Marionette. It turned out to be already pretty useful for bug 1410495. Given that all of the mochitests only use Marionette for installing an add-on, the additional log output would be minimal. But in case of failures it would help us a lot to investigate issues during startup.

If you are ok with that I kinda would like to make this a permanent change.
Flags: needinfo?(jmaher)
this sounds like a good idea.
Flags: needinfo?(jmaher)
Ok, I filed bug 1411913 to get the trace logging for Marionette enabled.
Depends on: 1411913
Product: Firefox → Testing
With my patch on bug 1411913 landed I see the following:

00:26:27     INFO -  GECKO(6172) | 1509063987888	Marionette	DEBUG	Received observer notification "profile-after-change"
00:26:28     INFO -  GECKO(6172) | 1509063988010	Marionette	DEBUG	Received observer notification "command-line-startup"
00:26:28     INFO -  GECKO(6172) | 1509063988010	Marionette	INFO	Enabled via --marionette
00:26:29     INFO -  GECKO(6172) | 1509063989060	Marionette	DEBUG	Received observer notification "sessionstore-windows-restored"
00:26:29     INFO -  GECKO(6172) | 1509063989483	Marionette	DEBUG	New connections are accepted
00:26:29     INFO -  GECKO(6172) | 1509063989483	Marionette	INFO	Listening on port 2828

So we correctly initialize and start the socket listener. But then not a single socket connection attempt from marionette client is received. I wonder if something is broken in the socket server implementation, which sometimes doesn't let us connect if Firefox starts up really quickly. It also reminds me on a situation we had on bug 1410366 comment 32, where no connection can be made because of a hang in `start_session`. No data is getting received by the client, even the connection has been established.

What I had to make sure is that always `wait_for_port()` is being called, which tries to setup a connection until the first data actually is getting send. That's not what `start_session()` is always doing. At least it does it when Marionette doesn't handle the connection itself. But then we actually do not take care of the result:

https://dxr.mozilla.org/mozilla-central/rev/aa958b29c149a67fce772f8473e9586e71fbdb46/testing/marionette/client/marionette_driver/marionette.py#1225

That means instead of `wait_for_port()` we should better call `raise_for_port()` which automatically raises an IOError if no connection can be made. Or even better, move out this logic from `start_session` completely, and require the client to call it itself. I will raise a new bug for that so we can get this discussed and fixed.
:whimboo- what is the current status here?  I see the marionette port wait is fixed, it would be good to have a recent update in the bug for others to see the status.
Flags: needinfo?(hskupin)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #56)
> :whimboo- what is the current status here?  I see the marionette port wait
> is fixed, it would be good to have a recent update in the bug for others to
> see the status.

I was out since last Friday and just came back. So I wasn't able to give an update since my patch on bug 1412222 landed.

So if I'm not completely wrong in analyzing the numbers from OF, the failure count dropped a lot compared to mid of last week. But not sure what that means, given that it goes up and down a lot.

By inspecting various logs I can see that the server component of Marionette runs, and also registers the necessary framescripts for each tab. But the Marionette client somewhat hangs, and cannot connect. So I think that we need the Marionette client logging setup, which I started to work on via bug 1394381, and which is already referenced in the dependency list. I think that this is what we would need first to figure out issues with Marionette client.

Also in some cases I see the networking panel shown on the right side, which may interfere with our jobs:
https://treeherder.mozilla.org/logviewer.html#?repo=try&job_id=141217917&lineNumber=1579
https://public-artifacts.taskcluster.net/JzQ7JKmKQ4eq-X5AxbcYcw/0/public/test_info/mozilla-test-fail-screenshot_zzu0wd.png

Rob, could you have a look into that, so that it doesn't popup anymore?
Flags: needinfo?(hskupin) → needinfo?(rthijssen)
i created a pr to prevent the popup from appearing
https://github.com/mozilla-releng/OpenCloudConfig/pull/101

i've emailed firefox-ci requesting an exception to the change freeze rules in order to merge this pr.
Flags: needinfo?(rthijssen)
As I just found out there is a bug in the Marionette client code which extraordinary extends the socket timeout for `start_session` from 2s to 360s. As such we can see a hang in that method, if no hello string from Marionette server is getting send. I will get this fixed via bug 1413852.

With that in place we might no longer see this specific failure, because a timeout would happen with 2s. If that is the case then there is clearly a socket communication problem present.
Beside the last mentioned issue with marionette which is clearly the reason for a output timeout during startup, it could still be a problem when sending commands later, like we do for installing addons. This is because the default socket timeout is set to 360s.

I had a look at mozharness but there is no output timeout of 330s or 300s defined. So I wondered where this value exactly comes from. The only reference I was able to find is here:

https://dxr.mozilla.org/mozilla-central/rev/cd7217cf05a2332a8fd7b498767a07b2c31ea657/testing/mochitest/runtests.py#2655

But this should only be set for the js harness and nothing else, right? But as it looks like this value is set for all kinds of mochitests. As result the automation will bail out because the output timeout is shorter than the Marionette socket timeout of 360s.

This is most likely also the reason why we mainly only see this failure for opt/pgo builds. Debug builds always print out some lines on and off so the socket timeout of Marionette is hit earlier.

Geoff, can you figure out if we can just bump up this value to >360? Or we adjust it in case of Marionette is used a couple of lines below? 

I strongly believe that when bumping up this value, and my patch on bug 413852 landed, we will better see the underlying issue, which we then have to track in a different bug because the signature will change.
Flags: needinfo?(gbrown)
Great analysis :whimboo! I will look into the mochitest timeout today.
Increasing that timeout seems like a good idea -- bug 1414063.
Flags: needinfo?(gbrown)
As it looks like there are still no changes even with my patch on bug 1413852 landed. :( It means the fix on bug 1414063 should help us to at least see the real failure.
is it possible to test our automation without marionette in the picture?  Maybe that will solve the problem, or lead us to reduce the problems we see to a specific area.
If those addons can be installed without Marionette, sure. Someone would have to check that, and that it also works for release builds. Personally I doubt that, given that we added the dependency for Marioentte.
the patch requested in comment 58 should now be live. please let me know if any further popups appear.
https://github.com/mozilla-releng/OpenCloudConfig/commit/0c299dd5ef6445a6627c7599c80818c5c5d8a128
Talos installs addons without signing or marionette, we should be able to do the same for mochitest/reftest.  The hiccup I see is release builds- I am not sure about mozilla-release and esr, there might be different signing requirements there, although I believe mozilla-beta enforces everything.
Yes, beta and release enforce signing of extensions. Only temporary installed addons should work. This cannot be done by setting up the profile like for Talos before Firefox gets started.
Looks like the bump to 370s was maybe not enough? The signature only changed to "application timed out after 370 seconds with no output" now.

Maybe I should try to just push a try build including print debug lines for mochitest runtests.py and marionette client all over the place. I will do that on Monday.
We're going to take our starring over to bug 1414495 now, since automation.py is in treeherder's "filename" blocklist so it's a full-line search for this failure, so we need 330 in the summary for this one for failures below trunk, and 370 in the summary for that one for failures on trunk.
Blocks: 1414495
I see 43 failures on november 4th for windows10-debug, on what looks to be about 35 revisions- this goes with my theory that we see on average 1 failure/revision (and cut in half for debug).

Why I mention this is I have been experimenting with getting rid of marionette and testing on try:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=cc8ca406e04dc7ae4dc3680adb3422a4fba071dd

I found the failure rate so high (sadly test view doesn't help me here), it was very slow to find the failures I wanted in treeherder, so I...disabled tests- I have 22 skip-if's (2 of them are on the entire manifest accounting for 18 tests)- so in total I am disabling 38 tests and now I am seeing easier to understand failures:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=73f83ef918e5a260f31909a8ff8f96e33d76ec15

I am pushing with --rebuild 20, this means that it is similar to 20 pushes that run all the tests- in addition, I did this twice for each config, so it is similar to me doing --rebuild 40 for the 2 scenarios:
1) default -u mochitests --rebuild 40
2) nomarionette -u mochitests --rebuild 40

^ these do not test reftests, but mochitest accounts for 80%+ of the failures we see

What is odd is:
1) without marionette the cl jobs leak often in dom/events/
2) I see 2 instances of application timed out total- one per push of default

Think about this, I am effectively running 40 pushes and I see 2 instances of the failure (.05/push) which is showing up on debug at a rate of .5/push.  That is 10 times the failure rate on inbound/autoland.  When I had little to no tests disabled early on, I had a higher rate of automation.py timeouts.

Why can I not reproduce on try?  Possibly this is because:
1) we are running on a weekend and patterns are different?  (I think not as we had 93 failures yesterday)
2) try is different than autoland/inbound- actually I was doing artifact builds
3) we don't understand the problem
4) our list of tests are doing something funky with data/state of the browser during startup.

I am doing some pushes with --no-artifact to see if #2 is the problem- assuming I get similar results, I might theorize that disabling many tests would be beneficial and we should probably investigate why that would be the case.
I did 40x each test with --no-artifact and both with[out] marionette no automation.py timeouts- so now I am pushing with no tests skipped to see how many I get.
The timeouts have been changed. But lets keep the work happening on this bug and not bug 1414495.

(In reply to Geoff Brown [:gbrown] from comment #1)
> I added some logging in
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=6737b2be84e9b7cd23fcb1dcf4586abab865498a and
> reproduced the timeouts. In the timeout cases, it looks to me like
> start_session() completes very quickly -- maybe even before the server has
> started?? I wonder if it is actually connecting to something (what?) or if
> it is just failing but not noticing that it's failing.
> 
> I then added more logging and pushed to try, but then couldn't reproduce any
> time outs on startup -- just as jmaher reported in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1397201#c79.

18:46:18     INFO -  creating Marionette...
18:46:18     INFO -  starting session...
18:46:18     INFO -  2017-11-03 18:46:18.668000 start_session: starts
18:46:18     INFO -  2017-11-03 18:46:18.673000 start_session: after raise_for_port
18:46:18     INFO -  2017-11-03 18:46:18.675000 start_session: after connect
18:46:18     INFO -  2017-11-03 18:46:18.675000 start_session: calls send_message...
18:46:18     INFO -  2017-11-03 18:46:18.698000 start_session: after send_message
18:46:18     INFO -  started session
[..]
18:46:21     INFO -  GECKO(1896) | 1509734781926	Marionette	DEBUG	Received observer notification "sessionstore-windows-restored"
18:46:22     INFO -  GECKO(1896) | 1509734782546	Marionette	DEBUG	New connections are accepted
18:46:22     INFO -  GECKO(1896) | 1509734782546	Marionette	INFO	Listening on port 2828

Basically Marionette client should not be able to connect to the server socket unless it is actively listening on port 2828, which in this case happened 4s after the connection attempt. So why did that pass? It could only mean that there was another Firefox instance running occupying this port. But it must have been released quickly, because otherwise the socket server of the wanted Firefox instance would have been failed.
Summary: Intermittent mochitest, browser-chrome, devtools-chrome TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 330 seconds with no output → Intermittent mochitest, browser-chrome, devtools-chrome TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
I am not able to reproduce this on try anymore- I pushed without my patch to disable 38 tests and the tests failed as expected, but this specific issue did not show up.

While I see it still failing on integration branches, I am at a loss for how to reproduce this on try- especially given the failure rate of 530+/week.  Looking at the failures that are annotated to bug 1414495, they are almost all opt/pgo, only a couple debug failures (3 out of 24)- whereas before we were looking at about 50% being debug related.
Summary: Intermittent mochitest, browser-chrome, devtools-chrome TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output → Intermittent mochitest, browser-chrome, devtools-chrome TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 330 or 370 seconds with no output
most of the recent failures are opt/pgo and fail more often on clipboard || reftest related jobs- these are jobs that run on buildbot vs taskcluster.  We still get failures on taskcluster, but it seems to be a smaller percentage.  There is a patch in progress which will allow us to get mochitest-clipboard for windows10 on AWS (bug 1394757), that might help reduce the pain a bit, but only slightly.
I hit this problem now for a Marionette harness job. See bug 1400819 comment 14 for details.

Given that an unknown instance of Firefox seems to be around, I wonder if a previous job didn't cleanup correctly. As such I used the worker and queried for previous jobs:

https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64/workers/us-east-1/i-030c576e2ffcff094

> failed	     test-windows10-64/debug-marionette-headless-e10s   	        ORMYsomeRzmPaatiNzjQvQ	2 hours ago 	15 minutes ago
> completed  test-windows10-64/opt-cppunit   	                        I3CF5UQXSR6o4gML-xpNQQ	2 hours ago 	2 hours ago
> failed	     test-windows10-64/debug-firefox-ui-functional-local-e10s   	MWuhOGp9QiGNEE1ZVVO50A	6 hours ago 	4 hours ago
> failed	     test-windows10-64/debug-mochitest-media-e10s   	        FmsYFg8tSlelQ60ao4POpQ	7 hours ago 	6 hours ago

So the mochitest media job failed, which itself is using Marionette to install necessary temporarily extensions. Then every other test job based on Marionette fails with the output timeout of 1000s, or maybe in this case 330/370s.

Here the log of the mochitest-media job:

https://public-artifacts.taskcluster.net/FmsYFg8tSlelQ60ao4POpQ/0/public/logs/live_backing.log

At the end I can see lots of network issues with the socket thread. I also assume that due to this reason no cleanup of this Windows 10 worker has been done. Given that we currently make use of the same user for all jobs, this can definitely cause problems for the next job to be running on that worker.

Is there a way to say we reboot a worker in case a former test job failed, until bug 1399401 has been fixed and deployed?
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
this worker type is not configured to format and reboot after each task. we could implement that (it's a simple change in tc provisoner to the worker type config) if there is some issue with generic-worker not cleaning up after the task. i don't think there's much incentive to patch generic-worker at the old 8.3 version running on this worker type since the effort to upgrade to 10.x is underway. i'll let pete weigh in but i think the simplest fix is just to format and reboot after each task until the generic-worker upgrade lands.
Flags: needinfo?(rthijssen)
A solution for that would be great. But thinking more about it, a change like that should only help in those cases when the failure happens for the very first start of Firefox in that job (like bug 1391545). 

I checked a couple of recent failures from OF and noticed that for mochitest the failure happened after x number of starts of Firefox. So I would assume that we might also need a fix for mochitest/reftest/mozrunner/mozprocess to wait for the former process to be completely gone. It sounds similar to bug 1407203, or even bug 1410883.

Joel and Geoff, could you please have a look into those harnesses, and how they handle the Firefox process?

Maybe it might be good to add some code to both harnesses by using psutil, and to log remaining Firefox processes when none should be around. I don't have the time to look into this, sorry.
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Indeed the machine i-030c576e2ffcff094 is continuing to spit out timeouts for marionette for other tests as executed. I asked Greg to disable the machine which has been done at the tme I write this comment.

Rob, can you please log into that machine and check its state? It looks like this would be kinda helpful for us for continued investigation!
Flags: needinfo?(rthijssen)
Attached image processes.png
there does appear to be quite a few firefox processes running...
Flags: needinfo?(rthijssen)
should we fail if there are existing firefox processes running (we should check for release/beta/nightly names used)?

I agree we should ensure the mozprocess (which manages many of our harnesses)- gbrown, do you have more ideas?
Flags: needinfo?(jmaher)
See Also: → 1415245
See Also: → 1415247
Thanks Henrik. I filed a couple of bugs to follow-up on those ideas and will try to find time to pursue them.
Flags: needinfo?(gbrown)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #91)
> should we fail if there are existing firefox processes running (we should
> check for release/beta/nightly names used)?

I think if the processes cannot be closed while the tests are running, we certainly should fail! It indicates a problem during shutdown for Firefox. This can have different reasons:

* A content process cannot be killed even not by the watchdog thread (bug 1395504)
* Content processes are still alive after an exit of Firefox (bug 1396574) and might cause a Firefox zombie process left around
* A slow running script which freezes Firefox during shutdown (bug 1391594)

Those are just a couple of situation I have in mind right now. There might be others / more in the future.

When Marionette handles the instance we clearly raise an exception in case the process does not correctly shutdown. But that is not done when Marionette is used from mochitest and reftest. Here the harness is only used to install addons and not more. The relevant code in Marionette for socket failures can be found here:

https://dxr.mozilla.org/mozilla-central/rev/2535bad09d720e71a982f3f70dd6925f66ab8ec7/testing/marionette/client/marionette_driver/marionette.py#817-843

Geoff, I think we have to take those things into account which will make the harness to better report on those situations.
See Also: → 1415290
i've gone ahead and added:

        "numberOfTasksToRun": 1,

to the worker type configuration for gecko-t-win10-64.

this will force new instances to format the z: drive and reboot between tasks.
Here a quick update from today's observations:

Due to the left-over instances of Firefox Marionette connects to one of those, before the instance as started by mochitests has been completed starting. This is because Marionette always uses port 2828 right now and does not check for a free port (bug 1240830). As such the newSession command is getting sent to this other instance, which might be broken in some ways. I assume that because Marionette server never returns from that command, but we also do not see the logs in what could be broken.

I'm not able to reproduce that locally, which makes it actually hard to get investigated.
Recent test logs now have a system-info.log artifact, which includes a process list taken just before the test harness (like runtests.py) is started. *Some* recent failures, like https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=143119664&lineNumber=1895 / https://public-artifacts.taskcluster.net/Cek62BLURGWh0S1mNMU5VA/0/public/test_info/system-info.log, show multiple firefox.exe processes and multiple desktop_unittest.py instances.

I thought we started rebooting before every task -- is that not fully deployed yet?
Although this bug has more than enough failures for "disable-recommended" status, there's no single test to disable here. Failures are generally on Windows 10; we could make those tasks tier 3, but that seems extreme, so taking no action for now.
:garndt, what is required to reboot a windows AMI between every test?  maybe we do this on failure status codes to save some cycles?
Flags: needinfo?(garndt)
I'm almost certain that they currently reboot between tests, but I might be mistaken.  Pete/Rob, do you know off hand?
Flags: needinfo?(garndt) → needinfo?(rthijssen)
You're going to see a significant decrease in frequency, which means absolutely nothing about anything being fixed or improved. The only thing it means is that it was pointed out to me that taskcluster now has a working button for Disable which I have permission to press, so when I star one of these I also look at whether the instance has failed most of what it has done recently, and disable it if it has.
all test instances for win 7 and 10, gpu and non are configured to reboot between tasks since comment 95 above.

however, there were several hundred instances running this morning that booted before the configuration change 2 days ago. those instances were running on older configuration so i terminated everything created before the config change just now. going forward, we should only see instances that reboot between every task.

note that there is no setting in generic worker to only reboot after failed tasks. it's only possible to set the number of tasks to run between reboots.
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
(In reply to Rob Thijssen (:grenade - UTC+2) from comment #102)
> all test instances for win 7 and 10, gpu and non are configured to reboot
> between tasks since comment 95 above.
> 
> however, there were several hundred instances running this morning that
> booted before the configuration change 2 days ago. those instances were
> running on older configuration so i terminated everything created before the
> config change just now. going forward, we should only see instances that
> reboot between every task.
> 
> note that there is no setting in generic worker to only reboot after failed
> tasks. it's only possible to set the number of tasks to run between reboots.

There's been a massive improvement to the failure rate here over the last few days, and I no longer see old tasks in system-info.log -- thanks!
Great to hear! I had a look at OF and since the reboot got enabled for all workers, there was only a single report logged:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=143980332&lineNumber=4597

I cannot see Firefox running before the mochitests started, so maybe that is something else.

Now I wonder how we can proceed in figuring out which job/test actually causes Firefox instances to lingering around.
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:infra]
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.