Closed Bug 1345904 Opened 3 years ago Closed 3 years ago

Frequent failures in windows 7 vm pgo tests with Automation Error: Received unexpected exception while running application

Categories

(Core :: General, defect)

54 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- fixed
firefox55 + fixed

People

(Reporter: cbook, Unassigned)

References

()

Details

like https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=windows%207%20vm%20pgo&fromchange=d9d91d6e389ef19bd48eb5153c9e32a122a1eecc&selectedJob=82019761



browser chrome/mochitests seems to fail a lot and seems to be green on retrigger seems this started at the last merge day.

From the logs:

https://treeherder.mozilla.org/logviewer.html#?job_id=82790773&repo=mozilla-aurora&lineNumber=2859



7:18:07     INFO - TEST-INFO | started process Main app process
07:20:08     INFO -  Traceback (most recent call last):
07:20:08     INFO -    File "c:\slave\test\build\tests\mochitest\runtests.py", line 2482, in doTests
07:20:08     INFO -      marionette_args=marionette_args,
07:20:08     INFO -    File "c:\slave\test\build\tests\mochitest\runtests.py", line 2086, in runApp
07:20:08     INFO -      self.marionette.start_session(timeout=port_timeout)
07:20:08     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
07:20:08     INFO -      m._handle_socket_failure()
07:20:08     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
07:20:08     INFO -      return func(*args, **kwargs)
07:20:08     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1284, in start_session
07:20:08     INFO -      self.protocol, _ = self.client.connect()
07:20:08     INFO -    File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
07:20:08     INFO -      self.sock.connect((self.addr, self.port))
07:20:08     INFO -    File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
07:20:08     INFO -      return getattr(self._sock,name)(*args)
07:20:08     INFO -  error: [Errno 10061] No connection could be made because the target machine actively refused it
07:20:08    ERROR - Automation Error: Received unexpected exception while running application
Have you tried bisecting this on Try with uplift simulation patches?
Flags: needinfo?(cbook)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #1)
> Have you tried bisecting this on Try with uplift simulation patches?

at a first glance i would say this happened between February 15 and 27th based on the simulations
Flags: needinfo?(cbook)
Component: General Automation → General
Flags: needinfo?(dburns)
Product: Release Engineering → Core
QA Contact: catlee
Version: unspecified → 54 Branch
Bisect Builds are underway.

From a look back at the uplift simulations https://mzl.la/2lVzN9I seems to be the regression range. Joel can you take a look maybe something stands out for you that is likely causing this (beside of the running bisect try runs
Flags: needinfo?(jmaher)
i looked back at the simulation runs and what i discovered is that this problem (while it looks its only on windows 7 vm pgo) its also on windows 7 vm opt (that we don't run on default it seems on production trees) - while the normal (non vm) windows 7 opt run don't have this failure at the same time. I guess if we would run e10s mochitest on non vm maschines for windows 7 too, we would see the same.

arr, markco, grenade: do you know if there is a fundamental difference between windows 7 vm opt (and pgo) between the non vm ones that could explain this 

01:24:25     INFO -      chunk = self.sock.recv(bytes_to_recv)
01:24:25     INFO -  error: [Errno 10054] An existing connection was forcibly closed by the remote host
01:24:25    ERROR - Automation Error: Received unexpected exception while running application

(from https://treeherder.mozilla.org/logviewer.html#?job_id=81827388&repo=try&lineNumber=2838) ?
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(arich)
the only difference in the environment setup between vm and non should be the presence of GPU drivers and access to the GPU via the VM layer on the non vm instances.
Flags: needinfo?(rthijssen)
pgo vs opt vs debug are the same machines, and we don't configure different instances for pgo.  Please ping me again when the range is narrowed down to a day or two- looking over thousands of changesets is not very useful.
Flags: needinfo?(jmaher)
The difference in errors between comment 0 and comment 7 suggests potentially 2 problems. In comment 7, Marionette started where in comment 0. The suggests it could be a startup crash happening at different times but further bisection might help find the cause.
Flags: needinfo?(dburns)
Flags: needinfo?(arich)
Flags: needinfo?(mcornmesser)
Spent some time in bisect builds and try builds and:

Uplift Simulations [1] on Feb 27th were showing the error we now see like in https://treeherder.mozilla.org/logviewer.html#?job_id=80447625&repo=try but a try run with a later rev show this is fairly intermittent

2 failures in 230 tests -> https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c69d44946b98040c5f9c23cb1e175bb42548d5e

Also a try run with the current aurora tree show some of the failures but also a lot of green runs for the same test
-> https://treeherder.mozilla.org/#/jobs?repo=try&revision=92df3032b2e9633b2b0c5e3861378704f46daa81

and the latest aurora as beta uplift simulations doesn't show this error at all in https://treeherder.mozilla.org/#/jobs?repo=try&revision=68f4d254bc812ab4003b31e67be4b265f0fcf30f (ignore the other failures just windows 7 vm opt is interesting here :)

So this could mean we have either some startup crash somewhere before that didn't got noticed or something is not starting up as fast we expect ? 


[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=4319e10f0da544b50835676761b0289fafda3404&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception
I discussed comment 12 with Tomcat and we sorted out some issues in the bisection procedure. Taking another stab at it, bug 1312883 is beginning to look like a plausible candidate for this. The Try push for it backed out from Aurora is looking green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=123fa634bd19692571790b5cc0803bb14d376753

Mike, any ideas what might be going on here?
Flags: needinfo?(mconley)
I think the Try run in comment 13 is conclusive enough to officially claim causality. I pinged Mike about it on IRC and he asked for a day to look into it before formally doing the backout, which sounds reasonable enough to me.
Blocks: 1312883
(In reply to Ryan VanderMeulen [:RyanVM] from comment #13)
> I discussed comment 12 with Tomcat and we sorted out some issues in the
> bisection procedure. Taking another stab at it, bug 1312883 is beginning to
> look like a plausible candidate for this. The Try push for it backed out
> from Aurora is looking green:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=123fa634bd19692571790b5cc0803bb14d376753
> 
> Mike, any ideas what might be going on here?

yeah i agree with Ryan the latest try run with the backout is entirely green, so this should fix this de-facto tree closure bug/regression
So I've been looking at this today, and I'm having trouble getting actionable data out of this.

Do we know for certain that the parent process is crashing? If so, where are the stacks? How do I recreate this scenario locally so I can debug it? Who can answer these questions?
Flags: needinfo?(mconley) → needinfo?(ryanvm)
To close the loop from comment 17, Mike's been working with people on IRC and IRL to debug this. In the mean time, bug 1312883 was backed out from Aurora and the tests are all greening up. Leaving this bug open for the investigation so we don't get burned by this again when 55 goes to Aurora next month.
Flags: needinfo?(ryanvm)
Update:

Setting --enable-profiling didn't help, and RyanVM was also able to reproduce in a non-PGO build.

The backout has occurred, which means I've bought a month to figure this out. Joy!

I'll file a bug to request a Win 7 VM loaner.
From the pushes I did on Try today, I learned a couple things:
* While the failures *can* happen with regular opt builds without PGO enabled, PGO makes the failures a lot more frequent.
* --enable-profiling is a red herring, adding it back to the configs made no difference in frequency.
* Changing the version number back to 54.0a1 (and in effect getting NIGHTLY_BUILD set again) makes the failures go away.
[Tracking Requested - why for this release]: needs a fix before 55 hits aurora on 2017-04-17, or we'll be right back in the same boat again.
Tracking 55+ based on the reason in Comment 23.
The patch that mystor posted to try here[1] gets rid of a deadlock that might be responsible for this problem.

When that lands, I'm going to try to re-apply bug 1312883 and that patch to my aurora tip and push to try, and see if this addresses the issue.

[1]: https://hg.mozilla.org/try/rev/b613b42a25bad5765af95b03e9b0ef8bc7f9d44c
Bug 1346415 has now hit central.  Can this be marked as fixed in 55?
Flags: needinfo?(mconley)
Yeah, I think so. Let's re-open if uplift simulations show the bug rearing its head again.
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(mconley)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.