Intermittent ASAN Wd TEST-UNEXPECTED-TIMEOUT | expected OK [shutdown hang]
Categories
(Remote Protocol :: WebDriver BiDi, defect, P5)
Tracking
(Not tracked)
People
(Reporter: whimboo, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: intermittent-failure)
Certain ASAN Wd jobs show timeouts due to a delay during shutdown. It's not clear to me where this extra minute (before the test times out) is actually spent, which can be seen in the logs:
https://treeherder.mozilla.org/logviewer?job_id=481088653&repo=autoland&lineNumber=9277-9278
[task 2024-11-04T14:38:53.282Z] 14:38:53 INFO - PID 1869 | count bytes template
[task 2024-11-04T14:38:53.284Z] 14:38:53 INFO - PID 1869 | 31 16288 nsComponentManagerImpl
[task 2024-11-04T14:38:53.286Z] 14:38:53 INFO - PID 1869 | 2 288 libfontconfig.so
[task 2024-11-04T14:38:53.287Z] 14:38:53 INFO - PID 1869 | 1 9496 style::sharing::SHARING_CACHE_KEY
[task 2024-11-04T14:38:53.289Z] 14:38:53 INFO - PID 1869 | 1 4104 style::bloom::BLOOM_KEY
[task 2024-11-04T14:38:53.290Z] 14:38:53 INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:38:53.485Z] 14:38:53 INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:38:53.487Z] 14:38:53 INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:38:53.489Z] 14:38:53 INFO - PID 1869 | Suppressions used:
[task 2024-11-04T14:38:53.489Z] 14:38:53 INFO - PID 1869 | count bytes template
[task 2024-11-04T14:38:53.489Z] 14:38:53 INFO - PID 1869 | 31 16288 nsComponentManagerImpl
[task 2024-11-04T14:38:53.489Z] 14:38:53 INFO - PID 1869 | 2 288 libfontconfig.so
[task 2024-11-04T14:38:53.492Z] 14:38:53 INFO - PID 1869 | 1 9496 style::sharing::SHARING_CACHE_KEY
[task 2024-11-04T14:38:53.492Z] 14:38:53 INFO - PID 1869 | Suppressions used:
[task 2024-11-04T14:38:53.492Z] 14:38:53 INFO - PID 1869 | count bytes template
[task 2024-11-04T14:38:53.492Z] 14:38:53 INFO - PID 1869 | 31 16288 nsComponentManagerImpl
[task 2024-11-04T14:38:53.492Z] 14:38:53 INFO - PID 1869 | 2 288 libfontconfig.so
[task 2024-11-04T14:38:53.492Z] 14:38:53 INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:38:53.493Z] 14:38:53 INFO - PID 1869 | 1 4104 style::bloom::BLOOM_KEY
[task 2024-11-04T14:38:53.494Z] 14:38:53 INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:39:56.367Z] 14:39:56 INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/bidi/browsing_context/user_prompt_opened/user_prompt_opened.py | expected OK
[task 2024-11-04T14:39:56.367Z] 14:39:56 INFO - TEST-INFO took 105002ms
[task 2024-11-04T14:39:56.368Z] 14:39:56 INFO - Restarting browser for new test group
Reporter | ||
Comment 1•4 months ago
|
||
Hi Christian. Do you have an idea to figure out what actually happens between the following lines when an ASAN build is shutting down?
[task 2024-11-04T14:38:53.494Z] 14:38:53 INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:39:56.367Z] 14:39:56 INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/bidi/browsing_context/user_prompt_opened/user_prompt_opened.py | expected OK
In cases like the following test job it works just fine:
https://treeherder.mozilla.org/logviewer?job_id=481171737&repo=mozilla-central&lineNumber=257725-257728
[task 2024-11-04T23:59:10.558Z] 23:59:10 INFO - PID 23294 | 147 28072 swrast_dri.so
[task 2024-11-04T23:59:10.558Z] 23:59:10 INFO - PID 23294 | -----------------------------------------------------
[task 2024-11-04T23:59:10.655Z] 23:59:10 INFO - PID 23294 | 1730764750654 geckodriver::browser DEBUG Browser process stopped: exit status: 0
[task 2024-11-04T23:59:10.656Z] 23:59:10 INFO - PID 23294 | 1730764750654 webdriver::server DEBUG <- 200 OK {"value":null}
Thanks!
Comment 2•4 months ago
•
|
||
I don't understand the question - According to your log, the build is not shutting down, it is timing out?
Edit: Ah, this timeout is during shutdown. If these are debug builds, then our debug shutdown procedure can of course take longer on ASan, but there is nothing special from ASan itself happening there that could be debugged through ASan.
Reporter | ||
Comment 3•4 months ago
|
||
It is unclear what specifically causes that huge delay during shutdown. Marionette triggers an internal quit and the quit-application
notification is received. Nothing reports that the build is refusing to shutdown. We also already have async shutdown logs enabled which usually give helpful details in why a shutdown is delayed. But here for the ASAN builds basically nothing happens. Not a single line of output from Firefox.
As such I wonder if there is some kind of other log (maybe via MOZ_LOG) that we could enable for ASAN builds to output more details.
Comment 4•4 months ago
|
||
If there are such settings, then they are unrelated to ASan. ASan itself has nothing "extra" running at shutdown afaik, just the usual overhead from memory reads/writes.
Note that if you recently introduced a deadlock of some sort, it could show up on ASan and not show up on normal because of timing changes in ASan builds. That would still not be an ASan related bug but a general shutdown hang bug.
Reporter | ||
Comment 5•4 months ago
|
||
What are these timing changes? Is there a dedicated bug for that? These shutdown hangs started by last week as it looks like, so maybe there is an overlap?
Reporter | ||
Comment 6•3 months ago
|
||
This is most likely as well related to the same underlying issue as bug 1927416. Due to system session resume we most likely have a huge number of left over Firefox processes around, and that cause a slowdown of the whole worker, resulting in slow shutdown times as well.
Comment 7•3 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #5)
What are these timing changes?
By timing changes I simply meant that ASan builds are slower. And the slowdown is not exactly linear but depends on memory accesses, etc. so e.g. data races can manifest in different ways than in a normal build. But again, if that's causing problems, it is not ASan that's the issue.
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #6)
This is most likely as well related to the same underlying issue as bug 1927416. Due to system session resume we most likely have a huge number of left over Firefox processes around, and that cause a slowdown of the whole worker, resulting in slow shutdown times as well.
Makes sense. Each process also consumes more memory with ASan, which can increase the slow down further compared to normal builds.
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 9•2 months ago
|
||
We moved wdspec tests to MacOS 14.70. Those workers are not affected and tests run more stable.
Comment hidden (Intermittent Failures Robot) |
Description
•