Closed Bug 1487243 Opened 6 years ago Closed 6 years ago

Intermittent linux/asan mn,mochitest,wdspec "TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output" (hang on start)

Categories

(Core :: Widget: Gtk, defect, P3)

All
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla64
Tracking Status
firefox63 --- fixed
firefox64 --- fixed

People

(Reporter: gbrown, Assigned: stransky)

References

(Blocks 1 open bug)

Details

(Keywords: hang)

In bug 1414495 there are frequent reports of "application timed out after 370 seconds with no output" on linux/asan mochitests.

https://treeherder.mozilla.org/logviewer.html#?job_id=196470320&repo=autoland&lineNumber=2628

https://treeherder.mozilla.org/logviewer.html#?job_id=196404206&repo=mozilla-central&lineNumber=2631

https://treeherder.mozilla.org/logviewer.html#?job_id=196394105&repo=mozilla-inbound&lineNumber=3003

[task 2018-08-29T09:41:06.879Z] 09:41:06     INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpCIzsbX.mozrunner
[task 2018-08-29T09:41:06.900Z] 09:41:06     INFO - runtests.py | Application pid: 3100
[task 2018-08-29T09:41:06.900Z] 09:41:06     INFO - TEST-INFO | started process GECKO(3100)
[task 2018-08-29T09:44:06.976Z] 09:44:06     INFO - runtests.py | Waiting for browser...
[task 2018-08-29T09:47:16.903Z] 09:47:16     INFO - Buffered messages finished
[task 2018-08-29T09:47:16.904Z] 09:47:16    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
I notice that all the marionette debug messages are missing. Normally we would see

[task 2018-08-29T10:28:47.228Z] 10:28:47     INFO - TEST-INFO | started process GECKO(1081)
[task 2018-08-29T10:28:49.287Z] 10:28:49     INFO - GECKO(1081) | 1535538529282	Marionette	DEBUG	Received observer notification profile-after-change
[task 2018-08-29T10:28:49.449Z] 10:28:49     INFO - GECKO(1081) | 1535538529442	Marionette	DEBUG	Received observer notification command-line-startup
[task 2018-08-29T10:28:49.451Z] 10:28:49     INFO - GECKO(1081) | 1535538529443	Marionette	DEBUG	Received observer notification nsPref:changed
[task 2018-08-29T10:28:49.452Z] 10:28:49     INFO - GECKO(1081) | 1535538529443	Marionette	DEBUG	Init aborted (running=false, enabled=true, finalUIStartup=false)
[task 2018-08-29T10:28:49.793Z] 10:28:49     INFO - GECKO(1081) | 1535538529786	Marionette	DEBUG	Received observer notification toplevel-window-ready
[task 2018-08-29T10:28:53.403Z] 10:28:53     INFO - GECKO(1081) | 1535538533394	Marionette	DEBUG	Received observer notification sessionstore-windows-restored
[task 2018-08-29T10:28:53.403Z] 10:28:53     INFO - GECKO(1081) | 1535538533394	Marionette	DEBUG	Waiting for delayed startup...
[task 2018-08-29T10:28:54.786Z] 10:28:54     INFO - GECKO(1081) | 1535538534781	Marionette	DEBUG	Waiting for startup tests...
Screenshots show the linux desktop -- no sign of firefox.
It would mean that we hang very early during startup, or Firefox doesn't get started at all?
Priority: -- → P2
Yes. It looks like a hang very early in startup. I am making good progress on tracking it down.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6574ad34b2bea0b1e0de4bc6c571545a11f81aaa demonstrates the hang with extra logging.

https://treeherder.mozilla.org/logviewer.html#?job_id=197418320&repo=try&lineNumber=2774

[task 2018-09-04T17:19:16.945Z] 17:19:16     INFO - TEST-INFO | started process GECKO(2717)
[task 2018-09-04T17:19:17.203Z] 17:19:17     INFO - GECKO(2717) | ZZZ XRE_mainRun calls SetWindowCreator...
[task 2018-09-04T17:19:17.204Z] 17:19:17     INFO - GECKO(2717) | ZZZ SetWindowCreator creating nsIWindowCreator...
[task 2018-09-04T17:19:17.205Z] 17:19:17     INFO - GECKO(2717) | ZZZ nsAppStartup::Init getting nsAppShell...
[task 2018-09-04T17:19:17.206Z] 17:19:17     INFO - GECKO(2717) | ZZZ nsAppShell::Init calls WakeLockListener::GetSingleton and AddWakeLockListener
[task 2018-09-04T17:19:17.207Z] 17:19:17     INFO - GECKO(2717) | ZZZ WakeLockListener::GetSingleton
[task 2018-09-04T17:19:17.207Z] 17:19:17     INFO - GECKO(2717) | ZZZ WakeLockListener::GetSingleton creating new...
[task 2018-09-04T17:22:16.954Z] 17:22:16     INFO - runtests.py | Waiting for browser...
[task 2018-09-04T17:25:27.218Z] 17:25:27     INFO - Buffered messages finished
[task 2018-09-04T17:25:27.219Z] 17:25:27    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

It appears that we are hung on the call to dbus_bus_get() in the WakeLockListener ctor at:

https://dxr.mozilla.org/mozilla-central/rev/c2e3be6a1dd352b969a45f0b85e87674e24ad284/widget/gtk/WakeLockListener.cpp#401
Component: General → Widget: Gtk
Product: Testing → Core
Version: Version 3 → unspecified
I wonder if we had a docker image upgrade that time when the hang started.
(In reply to Henrik Skupin (:whimboo) from comment #6)
> I wonder if we had a docker image upgrade that time when the hang started.

Unfortunately, I'm not sure when the hang really started.
Several test suites are failing due to this hang. It would be great to get this investigated.
Summary: Intermittent linux/asan mochitest | application timed out after 370 seconds with no output (hang on start) → Intermittent linux/asan mochitest,wdspec | application timed out after 370 seconds with no output (hang on start)
Summary: Intermittent linux/asan mochitest,wdspec | application timed out after 370 seconds with no output (hang on start) → Intermittent linux/asan mn,mochitest,wdspec | application timed out after 370 seconds with no output (hang on start)
This hang seem to cause more and more hangs during start-up and now a lot of Marionette unit tests are also affected.

Geoff, is there someone else we could try to ask to make some progress on this bug? Maybe Jim isn't around at the moment?
Flags: needinfo?(gbrown)
Thanks Henrik. Yes, I think you are right and Jim isn't available right now.

:selena - Can you help us find someone to look into this? I see this firefox startup hang frequently on linux/asan tests, can reproduce on try, and have provided a stack (comment 8).
Flags: needinfo?(gbrown) → needinfo?(sdeckelmann)
Please also note all the bugs in the blocking list, which have their own failure count. As such this problem is the reason for one of the top intermittent failures.
Jim should be back today. Putting this back in his court.
Flags: needinfo?(sdeckelmann)
I wonder if there is a correlation between this bug and bug 1382162 (no sessionstore-windows-restore notification received). At least we also see an increase of the other bug now on Linux ASAN and not Windows. I will mark it as being blocked by this bug for now.
Blocks: 1382162
Note that this is a hang of Firefox, and actually should be reflected in the severity of the bug.
Severity: normal → critical
Keywords: hang
(In reply to Selena Deckelmann :selenamarie :selena use ni? pronoun: she from comment #13)
> Jim should be back today. Putting this back in his court.

If Jim doesn't have the time, can we please get someone else to have a look at this? It's causing a couple of top OF failures. Thanks.
Flags: needinfo?(sdeckelmann)
Jim -- please have a look and reassign if this should be in someone else's court. Thanks!
Flags: needinfo?(sdeckelmann)
Summary: Intermittent linux/asan mn,mochitest,wdspec | application timed out after 370 seconds with no output (hang on start) → Intermittent linux/asan mn,mochitest,wdspec | automation.py | application timed out after 370 seconds with no output (hang on start)
OS: Unspecified → Linux
Hardware: Unspecified → All
Something to do with dbus connection initialization.
Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
Hey Martin, curious if you all have any ideas here? We're debugging some sort of dbus init error here that causes a hang in our automated tests.
Flags: needinfo?(jmathies) → needinfo?(stransky)
Flags: needinfo?(jmathies)
(Adjusting bug summary so it gets offered as a suggestion, per: https://mozilla.logbot.info/treeherder/20180928#c15387769)
Summary: Intermittent linux/asan mn,mochitest,wdspec | automation.py | application timed out after 370 seconds with no output (hang on start) → Intermittent linux/asan mn,mochitest,wdspec "TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output" (hang on start)
(In reply to Jim Mathies [:jimm] from comment #19)
> Hey Martin, curious if you all have any ideas here? We're debugging some
> sort of dbus init error here that causes a hang in our automated tests.

According to documentation and some reports the dbus_bus_get() can hang if there's a problem with DBus connection. I may be related to actual DBus version on the test boxes, wrong permissions to DBus socket or something else.

I think we should postpone the dbus_bus_get() call until we really need it at WakeLockListener::Callback() - that will tile the dbus_bus_get() call with the code what actually uses it and makes the bug more clear and won't block whole browser startup.

Filed as Bug 1495404.
Flags: needinfo?(stransky)
Two thoughts:

1. This seems to happen only on ASan builds.  Do we also see the dbus hang on non-ASan Linux, or is there maybe something important about ASan?

2. Is there any way to get the part of the stack within libdbus?  (There probably aren't a lot of affordances within the test harness for this, and I assume we don't install debug packages on the test VMs, but it might be possible do something with gcore(1) and exporting the core dump as an artifact.)
This has caused 30 failures for bug 1444600 on Linux 64, most on Asan.
The last failure of this type after the patch on bug 1495404 has been landed, was the following one by early afternoon (European time) yesterday.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a91bad5598839b3de0560cd34d5013e9bf69da3f&selectedJob=202635280

So this looks great now. Thanks a lot Martin for this quick patch!

Martin, I assume we need another new bug which handles the remaining problem of the hang, which has only been moved to a later time, right?

Can we close this bug as fixed now?
Flags: needinfo?(stransky)
Flags: needinfo?(jwalden+bmo)
Flags: needinfo?(jmathies)
(In reply to Henrik Skupin (:whimboo) from comment #27)
> The last failure of this type after the patch on bug 1495404 has been
> landed, was the following one by early afternoon (European time) yesterday.
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&revision=a91bad5598839b3de0560cd34d5013e9bf69da3f&selectedJob=2026352
> 80
> 
> So this looks great now. Thanks a lot Martin for this quick patch!
> 
> Martin, I assume we need another new bug which handles the remaining problem
> of the hang, which has only been moved to a later time, right?

It was moved to time when we actually use the DBus connection, when video is played and a screen saver is disabled. In that context we may expect timeouts/hangs on the test boxes.
Flags: needinfo?(stransky)
Ok, but I have the feeling that it will cause different summary to show up in Treeherder, given that this might most likely happen while a test is running. And as such new bugs will be filed.

I would say lets keep this bug open until the next OF bug comments come in. Then we can see if we can close it.
Priority: P2 → P3
I think we can take it for granted that this start-up hang for ASAN builds is gone now. There was no more failure for trunk since the patch has been landed, when it failed multiple times before during a day. Thanks again Martin.

Please note that the patch has already been uplifted to mozilla-beta.
Assignee: nobody → stransky
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
You need to log in before you can comment on or make changes to this bug.