Closed Bug 1377030 Opened 7 years ago Closed 5 years ago

Intermittent test_crash.py TestCrashInSetUp.test_crash_in_setup | AssertionError: "Process crashed" does not match "Process killed because the connection to Marionette server is lost

Categories

(Testing :: Marionette Client and Harness, defect, P3)

Version 3
Unspecified
Windows
defect

Tracking

(firefox-esr60 wontfix)

RESOLVED WORKSFORME
Tracking Status
firefox-esr60 --- wontfix

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

This is dependent on bug 1376795.
Depends on: 1376795
Whiteboard: [stockwell unknown]
this bug is still failing often, and I see bug 1376795 has recent progress, possibly we can revisit this soon.
I cannot reproduce it locally whatever I try to do. So I'm figuring out now how to use MacOS one click loaners, if possible yet.
no traction in bug 1395504, I pinged again there
The signature has been changed. Updating summary.
Summary: Intermittent test_crash.py TestCrashInSetUp.test_crash_in_setup | AssertionError: 0 != 1 → Intermittent test_crash.py TestCrashInSetUp.test_crash_in_setup | AssertionError: "Process crashed" does not match "Process killed because the connection to Marionette server is lost
Priority: -- → P3
Interesting that this is turning more and more into a Windows failure. Not sure what's happening here, also because even the debug builds don't offer much helpful data in the logs:

https://treeherder.mozilla.org/logviewer.html#?job_id=173004187&repo=mozilla-inbound&lineNumber=47885-47894

> 10:51:07     INFO -  1523789467597	Marionette	TRACE	4 -> [0,10,"WebDriver:ExecuteScript",{"scriptTimeout":null,"newSandbox":true,"args":[],"filename":"Z:\\task_1523787255\\build\\tes ... ypes.cast(zero, ctypes.PointerType(ctypes.int32_t));\n              var crash = badptr.contents;","sandbox":null,"line":104}]
> 10:53:18    ERROR -  TEST-UNEXPECTED-ERROR | testing\marionette\harness\marionette_harness\tests\unit\test_crash.py TestCrashInSetUp.test_crash_in_setup | AssertionError: "Process crashed" does not match "Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 10s)"

The crashing code gets sent, and we do not see any result being returned by Marionette server, which is expected. But I wonder why we see the 10s timeout coming from `execute_script`. The only place where we raise "Connection timed out after" is in `TcpTransport.receive()`, and here it uses `socket_timeout` for the timeout decision. So somewhere it is getting set to 10s, but no-where in our code this is done. I'm kinda baffled.

First failure on Windows was on January 25th 2018 so already a while ago:

https://treeherder.mozilla.org/logviewer.html#?job_id=158425866&repo=mozilla-central&lineNumber=36102
There have been 33 failures in the last 7 days.
Failures occur mostly on windows 7 / pgo.
There are also some failures on: OS X 10.10 / debug , windows10-64 / pgo and windows7-32-nightly / opt.

Here is a recent log file and a snippet with the failure:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=176081103&lineNumber=36734

e://gre/modules/AppConstants.jsm\");\n                return AppConstants.MOZ_CRASHREPORTER;","sandbox":"default","line":34}]
02:51:23     INFO -  1524883883062	Marionette	TRACE	2 <- [1,7,null,{"value":true}]
02:51:23     INFO -  1524883883063	Marionette	TRACE	2 -> [0,8,"Marionette:SetContext",{"value":"content"}]
02:51:23     INFO -  1524883883064	Marionette	TRACE	2 <- [1,8,null,{}]
02:51:23     INFO -  1524883883065	Marionette	TRACE	2 -> [0,9,"Marionette:SetContext",{"value":"chrome"}]
02:51:23     INFO -  1524883883066	Marionette	TRACE	2 <- [1,9,null,{}]
02:51:23     INFO -  1524883883067	Marionette	TRACE	2 -> [0,10,"WebDriver:ExecuteScript",{"scriptTimeout":null,"newSandbox":true,"args":[],"filename":"Z:\\task_1524883121\\build\\tes ... ypes.cast(zero, ctypes.PointerType(ctypes.int32_t));\n              var crash = badptr.contents;","sandbox":null,"line":104}]
02:53:34     INFO -  [Child 4436, Chrome_ChildThread] WARNING: pipe error: 232: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 513
02:53:34    ERROR -  TEST-UNEXPECTED-ERROR | testing\marionette\harness\marionette_harness\tests\unit\test_crash.py TestCrashInSetUp.test_crash_in_setup | AssertionError: "Process crashed" does not match "Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 10s)"
02:53:34     INFO -  Traceback (most recent call last):
02:53:34     INFO -    File "Z:\task_1524883121\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 140, in run
02:53:34     INFO -      self.setUp()
02:53:34     INFO -    File "Z:\task_1524883121\build\tests\marionette\tests\testing\marionette\harness\marionette_harness\tests\unit\test_crash.py", line 166, in setUp
02:53:34     INFO -      self.crash, chrome=True)
02:53:34     INFO -  TEST-INFO took 131125ms
Flags: needinfo?(hskupin)
Whiteboard: [stockwell unknown] → [stockwell needswork]
The number of failures dropped a lot. For Windows there is also a shutdown hang, which I would like to investigate one the failures on OS X have be fixed. This should happen soon.
All OS X crashes are gone now, so this is a Windows only problem. I will look into it by next week.
OS: Unspecified → Windows
There are no more failures for integration branches since May 26th. Lets see how this evolves during this week. Maybe a patch landed for Windows which doesn't hang Firefox after a chrome crash. Gabriele, are you aware of something? If not I will just wait the next days and re-check next Monday.
Flags: needinfo?(hskupin) → needinfo?(gsvelto)
Nothing that I'm aware of.
Flags: needinfo?(gsvelto)
This is all happening due to shutdown hangs of Firefox, which cannot be clearly investigated. The last detected hang on trunk is actually about a month ago:

https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-06-17&endday=2018-10-15&tree=trunk&bug=1377030

I'm closing this bug as WFM.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
No longer blocks: 1504482
So this failure started to happen again about a month ago. It's only happening on Windows were we seem to have a delayed process shutdown for a parent process crash. The test itself uses 10s of timeout, which I expected to be enough. But that doesn't seem to be the case, or opt/pgo builds of Firefox really regressed in shutdown times.

Gabriele, are you aware of any change in that area? Also what would be a good timeout value for a test? Should it be larger than 10s?
Flags: needinfo?(gsvelto)
We kill content processes after they're hung for 5 seconds so 10 seconds should be enough. I did encounter cases on try where shutdown was taking a lot longer on try but those seemed to be related to bug 1498942 which has been fixed. I wonder if that's related...
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #89)
> We kill content processes after they're hung for 5 seconds so 10 seconds
> should be enough. I did encounter cases on try where shutdown was taking a
> lot longer on try but those seemed to be related to bug 1498942 which has
> been fixed. I wonder if that's related...

Please note that this bug is about crashes of the parent process and not any content process. Also given that it is still happening it shouldn't be related to the other bug. Is there any kind of extra logging we could enable for those crash tests to help investigating shutdown issues due to crashes?
Flags: needinfo?(gsvelto)
(In reply to Henrik Skupin (:whimboo) from comment #91)
> Please note that this bug is about crashes of the parent process and not any
> content process. Also given that it is still happening it shouldn't be
> related to the other bug.

Sorry, I must have mixed this up with another bug, I'll have a look again and report back.

> Is there any kind of extra logging we could enable
> for those crash tests to help investigating shutdown issues due to crashes?

Nothing comes to mind. I debugged similar issues in the past and they were usually complex races between gecko and the test harness and to detect them I had to reproduce the issue with plenty of debugging statements that I had added manually. Leaving the NI so I don't forget.
So maybe it really needs more than 10s to shutdown after the crash? We could extend that timeout a little bit, but would then be interested what the maximum value should actually be. The default 70s are too long for those crash tests.
I have to test how long it takes for the crash analysis to run in one of those tests to have a good idea of an upper bound.
I looked at some of the failures and here some interesting cases:

1) https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216076403&repo=autoland&lineNumber=38835 (pgo headless)

22:17:16     INFO -  1544307436956	Marionette	DEBUG	2 -> [0,10,"WebDriver:Navigate",{"url":"about:crashparent"}]
22:17:16     INFO -  1544307436963	Marionette	TRACE	[2147483649] Received DOM event beforeunload for about:blank
22:17:16     INFO -  1544307436975	Marionette	TRACE	[2147483649] Received DOM event pagehide for about:blank
22:17:16     INFO -  1544307436975	Marionette	TRACE	[2147483649] Received DOM event unload for about:blank
22:17:16     INFO -  1544307436976	Marionette	TRACE	[2147483649] Received observer notification outer-window-destroyed
22:17:16     INFO -  1544307436982	Marionette	TRACE	[24] Frame script loaded
22:17:16     INFO -  1544307436983	Marionette	TRACE	[24] Frame script registered
22:18:37     INFO -  [Child 1740, Chrome_ChildThrea

After the framescript got loaded in the content process it takes 1:20min until the next child log output. Not sure why we are even waiting that long given that the timeout should be 10s. Maybe something is wrong with setting this specific timeout here, and it's waiting the default 70s. But anyway it marks a shutdown hang in Firefox.

2) https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=219021882&repo=mozilla-central&lineNumber=60155 (debug)

23:33:34     INFO -  1545953614379	Marionette	TRACE	[12] Frame script loaded
23:33:34     INFO -  1545953614379	Marionette	TRACE	[12] Frame script registered
23:33:34     INFO -  Hit MOZ_CRASH(Crash via about:crashparent) at z:/build/build/src/docshell/base/nsAboutRedirector.cpp:40
[..]
23:33:34     INFO -  #55: RtlInitializeExceptionChain[C:\windows\SYSTEM32\ntdll.dll +0x637c8]
23:33:40     INFO -  --DOCSHELL 01468000 == 0 [pid = 4516] [id = {370ba78c-56db-45ae-8946-ec964713803e}]
23:33:48     INFO -  --DOMWINDOW == 11 (07FC5000) [pid = 5648] [serial = 8] [outer = 00000000] [url = about:blank]
23:33:48     INFO -  --DOMWINDOW == 10 (07FC0000) [pid = 5648] [serial = 6] [outer = 00000000] [url = about:blank]
23:33:48     INFO -  --DOMWINDOW == 9 (07FB8000) [pid = 5648] [serial = 5] [outer = 00000000] [url = about:blank]
23:33:48     INFO -  --DOMWINDOW == 8 (07FC4800) [pid = 5648] [serial = 7] [outer = 00000000] [url = about:blank]
23:33:50     INFO -  --DOMWINDOW == 2 (01469400) [pid = 4516] [serial = 2] [outer = 00000000] [url = about:blank]
23:33:50     INFO -  --DOMWINDOW == 1 (01470440) [pid = 4516] [serial = 1] [outer = 00000000] [url = about:blank]
23:33:54     INFO -  --DOMWINDOW == 0 (01461000) [pid = 4516] [serial = 3] [outer = 00000000] [url = about:blank]
23:34:32     INFO -  [Child 4516, Main Thread] WARNING: 'NS_FAILED(rv)', file z:/build/build/src/xpcom/base/MemoryTelemetry.cpp, line 276
23:34:33     INFO -  [Child 5648, Main Thread] WARNING: 'NS_FAILED(rv)', file z:/build/build/src/xpcom/base/MemoryTelemetry.cpp, line 276
23:34:55    ERROR -  TEST-UNEXPECTED-ERROR | testing\marionette\harness\marionette_harness\tests\unit\test_crash.py TestCrashInSetUp.test_crash_in_setup | AssertionError: "Process crashed" does not match "Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 10s)"

After the triggered MOZ_CRASH it also takes a long time until we fail. Not sure why we fail again with not only waiting 10s.

Maybe something blocks the process group to shutdown while waiting for the child processes to stop?

I've finally found some time to investigate this and I'm trying to reproduce the problem locally. One issue I'm having is that I have been unable to force mach to print out times for all the log entries. I.e. if I run the tests with:

./mach marionette-test -v -v --gecko-log- --headless testing/marionette/harness/marionette_harness/tests/unit/test_crash.py

I get the marionette output prefixed with times but the gecko output with no times making it hard to gauge how long a certain step took. Output looks like this:

0:40.56 TEST_START: testing\marionette\harness\marionette_harness\tests\unit\test_crash.py TestCrashInTearDown.test_crash_in_teardown
[...]
1548322922884 Marionette TRACE [12] Frame script loaded
1548322922884 Marionette TRACE [12] Frame script registered
Hit MOZ_CRASH(Crash via about:crashparent) at c:/Users/gsvelto/projects/mozilla-central.x64/docshell/base/nsAboutRedirector.cpp:40

Henrik is there a way to prefix all the output with timestamps like we do on automation?

Flags: needinfo?(gsvelto) → needinfo?(hskupin)

You specifically mean Hit MOZ_CRASH...? That part doesn't come from marionette or mozbase. It's logged by Firefox which might be configurable to get prefixed by a timestamp when using export MOZ_LOG=timestamp,...?

Flags: needinfo?(hskupin)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #100)

You specifically mean Hit MOZ_CRASH...? That part doesn't come from marionette or mozbase. It's logged by Firefox which might be configurable to get prefixed by a timestamp when using export MOZ_LOG=timestamp,...?

Yeah but it's not going through MOZ_LOG unfortunately, we have far too many ways to print out stuff from Gecko. I found an external solution by piping all the output from mach into the 'ts -s' command which prints 00:00:00-based timestamps in front of everything so it's easy to gauge the relative lengths.

BTW while testing I've already encountered one of this issue locally so there's hope I'll be able to figure it out without retriggering stuff dozens of times on try.

That sounds great. Thanks Gabriele!

Gabriele, would you mind to file a bug in the crashreporter component so we could make all intermittent bugs depend on it? There are some more which I would like to reference via the dependency. Thanks.

Flags: needinfo?(gsvelto)

Sure, will do.

Done. I added another bug I remembered to it but if there's more feel free to add them so we have them in one place.

Flags: needinfo?(gsvelto)
No longer blocks: 1523583
Depends on: 1523583

Last failure on central is from 3 months ago, and so far only the esr60 branch seems to be affected. Given that this might not get a fix, and I don't know which actually fixed that particular bug on Windows I'm going to mark this as WFM.

Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → WORKSFORME
Product: Testing → Remote Protocol
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in before you can comment on or make changes to this bug.