Closed Bug 1929101 Opened 4 months ago Closed 2 months ago

Intermittent ASAN Wd TEST-UNEXPECTED-TIMEOUT | expected OK [shutdown hang]

Categories

(Remote Protocol :: WebDriver BiDi, defect, P5)

Default
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: whimboo, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Certain ASAN Wd jobs show timeouts due to a delay during shutdown. It's not clear to me where this extra minute (before the test times out) is actually spent, which can be seen in the logs:

https://treeherder.mozilla.org/logviewer?job_id=481088653&repo=autoland&lineNumber=9277-9278

[task 2024-11-04T14:38:53.282Z] 14:38:53     INFO - PID 1869 |   count      bytes template
[task 2024-11-04T14:38:53.284Z] 14:38:53     INFO - PID 1869 |      31      16288 nsComponentManagerImpl
[task 2024-11-04T14:38:53.286Z] 14:38:53     INFO - PID 1869 |       2        288 libfontconfig.so
[task 2024-11-04T14:38:53.287Z] 14:38:53     INFO - PID 1869 |       1       9496 style::sharing::SHARING_CACHE_KEY
[task 2024-11-04T14:38:53.289Z] 14:38:53     INFO - PID 1869 |       1       4104 style::bloom::BLOOM_KEY
[task 2024-11-04T14:38:53.290Z] 14:38:53     INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:38:53.485Z] 14:38:53     INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:38:53.487Z] 14:38:53     INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:38:53.489Z] 14:38:53     INFO - PID 1869 | Suppressions used:
[task 2024-11-04T14:38:53.489Z] 14:38:53     INFO - PID 1869 |   count      bytes template
[task 2024-11-04T14:38:53.489Z] 14:38:53     INFO - PID 1869 |      31      16288 nsComponentManagerImpl
[task 2024-11-04T14:38:53.489Z] 14:38:53     INFO - PID 1869 |       2        288 libfontconfig.so
[task 2024-11-04T14:38:53.492Z] 14:38:53     INFO - PID 1869 |       1       9496 style::sharing::SHARING_CACHE_KEY
[task 2024-11-04T14:38:53.492Z] 14:38:53     INFO - PID 1869 | Suppressions used:
[task 2024-11-04T14:38:53.492Z] 14:38:53     INFO - PID 1869 |   count      bytes template
[task 2024-11-04T14:38:53.492Z] 14:38:53     INFO - PID 1869 |      31      16288 nsComponentManagerImpl
[task 2024-11-04T14:38:53.492Z] 14:38:53     INFO - PID 1869 |       2        288 libfontconfig.so
[task 2024-11-04T14:38:53.492Z] 14:38:53     INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:38:53.493Z] 14:38:53     INFO - PID 1869 |       1       4104 style::bloom::BLOOM_KEY
[task 2024-11-04T14:38:53.494Z] 14:38:53     INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:39:56.367Z] 14:39:56     INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/bidi/browsing_context/user_prompt_opened/user_prompt_opened.py | expected OK
[task 2024-11-04T14:39:56.367Z] 14:39:56     INFO - TEST-INFO took 105002ms
[task 2024-11-04T14:39:56.368Z] 14:39:56     INFO - Restarting browser for new test group
Blocks: 1890217

Hi Christian. Do you have an idea to figure out what actually happens between the following lines when an ASAN build is shutting down?

[task 2024-11-04T14:38:53.494Z] 14:38:53     INFO - PID 1869 | -----------------------------------------------------
[task 2024-11-04T14:39:56.367Z] 14:39:56     INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/bidi/browsing_context/user_prompt_opened/user_prompt_opened.py | expected OK

In cases like the following test job it works just fine:
https://treeherder.mozilla.org/logviewer?job_id=481171737&repo=mozilla-central&lineNumber=257725-257728

[task 2024-11-04T23:59:10.558Z] 23:59:10     INFO - PID 23294 |     147      28072 swrast_dri.so
[task 2024-11-04T23:59:10.558Z] 23:59:10     INFO - PID 23294 | -----------------------------------------------------
[task 2024-11-04T23:59:10.655Z] 23:59:10     INFO - PID 23294 | 1730764750654	geckodriver::browser	DEBUG	Browser process stopped: exit status: 0
[task 2024-11-04T23:59:10.656Z] 23:59:10     INFO - PID 23294 | 1730764750654	webdriver::server	DEBUG	<- 200 OK {"value":null}

Thanks!

Flags: needinfo?(choller)

I don't understand the question - According to your log, the build is not shutting down, it is timing out?

Edit: Ah, this timeout is during shutdown. If these are debug builds, then our debug shutdown procedure can of course take longer on ASan, but there is nothing special from ASan itself happening there that could be debugged through ASan.

Flags: needinfo?(choller)

It is unclear what specifically causes that huge delay during shutdown. Marionette triggers an internal quit and the quit-application notification is received. Nothing reports that the build is refusing to shutdown. We also already have async shutdown logs enabled which usually give helpful details in why a shutdown is delayed. But here for the ASAN builds basically nothing happens. Not a single line of output from Firefox.

As such I wonder if there is some kind of other log (maybe via MOZ_LOG) that we could enable for ASAN builds to output more details.

If there are such settings, then they are unrelated to ASan. ASan itself has nothing "extra" running at shutdown afaik, just the usual overhead from memory reads/writes.

Note that if you recently introduced a deadlock of some sort, it could show up on ASan and not show up on normal because of timing changes in ASan builds. That would still not be an ASan related bug but a general shutdown hang bug.

What are these timing changes? Is there a dedicated bug for that? These shutdown hangs started by last week as it looks like, so maybe there is an overlap?

This is most likely as well related to the same underlying issue as bug 1927416. Due to system session resume we most likely have a huge number of left over Firefox processes around, and that cause a slowdown of the whole worker, resulting in slow shutdown times as well.

See Also: → 1927416

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #5)

What are these timing changes?

By timing changes I simply meant that ASan builds are slower. And the slowdown is not exactly linear but depends on memory accesses, etc. so e.g. data races can manifest in different ways than in a normal build. But again, if that's causing problems, it is not ASan that's the issue.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #6)

This is most likely as well related to the same underlying issue as bug 1927416. Due to system session resume we most likely have a huge number of left over Firefox processes around, and that cause a slowdown of the whole worker, resulting in slow shutdown times as well.

Makes sense. Each process also consumes more memory with ASan, which can increase the slow down further compared to normal builds.

We moved wdspec tests to MacOS 14.70. Those workers are not affected and tests run more stable.

Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.