Frequent beta Assertion failure: mTreeManager, at /builds/worker/checkouts/gecko/gfx/layers/ipc/APZInputBridgeParent.cpp:35
Categories
(Core :: Graphics, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox121 | --- | affected |
People
(Reporter: intermittent-bug-filer, Unassigned)
References
Details
(Keywords: assertion, intermittent-failure)
Filed by: smolnar [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=432097331&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VslFPZKZSlS2MpwJoXlEPw/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VslFPZKZSlS2MpwJoXlEPw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
TEST-START | /infrastructure/reftest/reftest_timeout.html
[task 2023-10-11T11:33:53.917Z] 11:33:53 INFO - TEST-SKIP | /infrastructure/reftest/reftest_timeout.html | took 0ms
[task 2023-10-11T11:33:53.930Z] 11:33:53 INFO - Using 1 child processes
[task 2023-10-11T11:33:53.960Z] 11:33:53 INFO - Setting up ssl
[task 2023-10-11T11:33:54.778Z] 11:33:54 INFO - certutil | b''
[task 2023-10-11T11:33:54.977Z] 11:33:54 INFO - certutil | b''
[task 2023-10-11T11:33:55.006Z] 11:33:55 INFO - certutil | b'\nCertificate Nickname Trust Attributes\n SSL,S/MIME,JAR/XPI\n\nweb-platform-tests CT,, \r\n'
[task 2023-10-11T11:33:55.018Z] 11:33:55 INFO - Application command: Z:\task_169702309243806\build\application\firefox\firefox.exe -marionette about:blank --wait-for-browser -profile C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt
[task 2023-10-11T11:33:55.169Z] 11:33:55 INFO - Application command: Z:\task_169702309243806\build\application\firefox\firefox.exe -marionette about:blank --wait-for-browser -profile C:\Users\task_169702309243806\AppData\Local\Temp\tmpbwmmg7l4
[task 2023-10-11T11:33:55.175Z] 11:33:55 INFO - Starting runner
[task 2023-10-11T11:34:02.753Z] 11:34:02 INFO - PID 6604 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt\runtests_leaks_8948.log
[task 2023-10-11T11:34:03.723Z] 11:34:03 INFO - PID 6604 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt\runtests_leaks_8948_gpu_pid5872.log
[task 2023-10-11T11:34:03.820Z] 11:34:03 INFO - PID 6604 | 1697024043819 Marionette INFO Marionette enabled
[task 2023-10-11T11:34:04.016Z] 11:34:04 INFO - PID 6604 | 1697024044015 Marionette INFO Listening on port 61096
[task 2023-10-11T11:34:04.829Z] 11:34:04 INFO - PID 6604 | [Parent 1220, Main Thread] WARNING: Failed to retarget HTML data delivery to the parser thread.: file /builds/worker/checkouts/gecko/parser/html/nsHtml5StreamParser.cpp:1224
[task 2023-10-11T11:34:04.839Z] 11:34:04 INFO - PID 6604 | [WARN rkv::backend::impl_safe::environment] `load_ratio()` is irrelevant for this storage backend.
[task 2023-10-11T11:34:04.982Z] 11:34:04 INFO - PID 6604 | Initializing context 000001E5F68C2000 surface 000001E5F6856350 on display 000001E5F684C100
[task 2023-10-11T11:34:04.983Z] 11:34:04 INFO - PID 6604 | [GPU 5872, Renderer] WARNING: Failed to make GL context current!: file /builds/worker/checkouts/gecko/gfx/gl/GLContextProviderEGL.cpp:473
[task 2023-10-11T11:34:04.983Z] 11:34:04 INFO - PID 6604 | EGL Error: 0x3009
[task 2023-10-11T11:34:04.984Z] 11:34:04 INFO - PID 6604 | [GPU 5872, Renderer] WARNING: GLContext::InitWithPrefix failed!: file /builds/worker/checkouts/gecko/gfx/gl/GLContext.cpp:330
[task 2023-10-11T11:34:04.985Z] 11:34:04 INFO - PID 6604 | Destroying context 000001E5F68C2000 surface 000001E5F6856350 on display 000001E5F684C100
[task 2023-10-11T11:34:04.986Z] 11:34:04 INFO - PID 6604 | [GPU 5872, Renderer] WARNING: Failed to create GLContext from PBuffer: file /builds/worker/checkouts/gecko/gfx/gl/GLContextProviderEGL.cpp:1181
[task 2023-10-11T11:34:04.986Z] 11:34:04 INFO - PID 6604 | Initializing context 000001E5F68D6000 surface 000001E5F68570D0 on display 000001E5F684C100
[task 2023-10-11T11:34:04.987Z] 11:34:04 INFO - PID 6604 | GL_VENDOR: Google Inc. (NVIDIA)
[task 2023-10-11T11:34:04.987Z] 11:34:04 INFO - PID 6604 | mVendor: NVIDIA
[task 2023-10-11T11:34:04.988Z] 11:34:04 INFO - PID 6604 | GL_RENDERER: ANGLE (NVIDIA, NVIDIA Tesla M60 Direct3D11 vs_5_0 ps_5_0, D3D11-30.0.14.7239)
[task 2023-10-11T11:34:04.988Z] 11:34:04 INFO - PID 6604 | mRenderer: Unknown
[task 2023-10-11T11:34:04.988Z] 11:34:04 INFO - PID 6604 | mIsMesa: 0
[task 2023-10-11T11:34:05.290Z] 11:34:05 INFO - PID 6604 | [Parent 1220, GMPThread] WARNING: Failed to delete GMP storage directory: file /builds/worker/checkouts/gecko/dom/media/gmp/GMPServiceParent.cpp:1863
[task 2023-10-11T11:34:05.312Z] 11:34:05 INFO - PID 6604 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt\runtests_leaks_8948_tab_pid1252.log
[task 2023-10-11T11:34:26.128Z] 11:34:26 INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:26.148Z] 11:34:26 INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:26.149Z] 11:34:26 INFO - PID 6604 | [WARN webrender::device::gl] Missing optimized shader source for gpu_cache_update
INFO - PID 6604 | [Parent 1220, Main Thread] WARNING: 'aOwner->IsDiscarded()', file /builds/worker/workspace/obj-build/dist/include/mozilla/dom/SyncedContextInlines.h:95
[task 2023-10-11T11:34:26.260Z] 11:34:26 INFO - PID 6604 | [GPU 5872, Compositor] WARNING: Possibly dropping task posted to updater thread: file /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZUpdater.cpp:371
[task 2023-10-11T11:34:26.301Z] 11:34:26 INFO - PID 6604 | [Child 1252, Main Thread] WARNING: IPC message 'PContent::Msg_CleanupPendingLoadState' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:545
[task 2023-10-11T11:34:26.400Z] 11:34:26 INFO - PID 6604 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt\runtests_leaks_8948_tab_pid5432.log
[task 2023-10-11T11:34:39.194Z] 11:34:39 INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:39.196Z] 11:34:39 INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:39.258Z] 11:34:39 INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:39.259Z] 11:34:39 INFO - PID 6604 | Assertion failure: mTreeManager, at /builds/worker/checkouts/gecko/gfx/layers/ipc/APZInputBridgeParent.cpp:35
[task 2023-10-11T11:34:39.780Z] 11:34:39 INFO - STDOUT: Initializing stack-fixing for the first stack frame, this may take a while...
[task 2023-10-11T11:34:51.352Z] 11:34:51 INFO - TEST-START | /acid/acid2/reftest.html
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 7•10 months ago
|
||
This bug started as an intermittent one on beta-sims a month ago but in the last few days has become very frequent on beta as seen here: https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2023-10-23&endday=2023-11-22&tree=firefox-releases&failurehash=all&bug=1858436
First push that had more than one failure was this one: https://treeherder.mozilla.org/jobs?repo=mozilla-beta&resultStatus=testfailed%2Cbusted%2Cexception&revision=850029a700cfcd62c70e1153778fcac0c0accbd5&searchStr=win%2Creftest&selectedTaskRun=KEjWvlyoTTGWT3vb-88w-w.0
Comment 8•10 months ago
•
|
||
Bob, could you have a look over this one or redirect to someone else? Thank you.
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=437342206&repo=mozilla-beta&lineNumber=2668
Updated•10 months ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 10•10 months ago
|
||
Despite the gfx in the path of the failing assert, it seems like this is probably more in apz land?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•10 months ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 16•10 months ago
|
||
RE [stockwell disable-recommended]
, I don't think there's any test we can disable here. :-/
This looks like it's a fatal assertion-failure that happens at startup, if I'm reading the logs correctly. Presumably there's some sort of race condition at startup that we're now on the losing side of, due to some change that was recently made that works out differently on Nightly vs. Beta, or something along those lines?
We also can't simply remove/downgrade the assertion itself, because if it fails, we're potentially going to crash with a null-deref a bit later. (It's asserting that a pointer is non-null, and then we boldly proceed to use that pointer in other methods in the same class). So unless the failing situation is for a one-off never-actually used object or something, then the assertion is about an invariant that looks to be important to avoid further crashes.
botond: I wonder if you've had a chance to look and have any ideas here... We probably don't want to leave beta (and probably release?) in this state indefinitely.
Comment hidden (Intermittent Failures Robot) |
Comment 18•10 months ago
|
||
Per APZ/gfx sync meeting, I think this may be a failure specific to CI running on systems with a GPU, the datacenter GPUs like this one can have extremely different latency profiles because they are remote network devices (a whole rack dedicated to GPUs, separate from the regular CPU instances that talk to them), so there may be a race condition during startup where a call blocks when we don't expect it or vice versa, it may be worth forcing these tests to run in software just to see if it affects failure rate of this assert and determine if that tickled the bug, if so we may be able to modify the GPU process startup to accommodate something we didn't expect previously from D3D11 API behavior.
Comment 19•10 months ago
|
||
This feels similar to bug 1854633 and bug 1839589. These all seem to be happening on the same machines (usually win32 debug). The problem seems to be something early in the startup of the gpu process. I suspect they all have the same root cause. It seems that the infra changes around bug 1863630 might have triggered this.
Comment 20•10 months ago
|
||
Bug 1868357 seems similar, too (in that case, we failed some reftests due to unexpectedly using software-WR, after Killing GPU process due to IPC reply timeout
early-on). Comment 18 here seems suspiciously relevant to what may have happened there.
Comment 21•10 months ago
|
||
Yes, the assertion failure seems to indicate something going wrong during GPU process startup.
The assertion failure happens in code running in the GPU process during its startup. The code is triggered by the parent process sending a PGPU::InitAPZInputBridge
message in GPUProcessManager::CreateRemoteSession(). Note that the LayersId
argument sent is aRootLayerTreeId
, which is the LayersId representing the window's root layer tree (the one rendered by the parent process); a child LayersId (representing content rendered by a content process) is never sent.
The code in the GPU process looks up this root LayersId in the sIndirectLayerTrees
lookup table. This entry should have been added in CompositorBridgeParent::Initialize()
, triggered by this earlier line in CreateRemoteSession()
; the entry has its mParent field populated at the same time. Finally, mParent->mApzcTreeManager
is also populated in CompositorBridgeParent::Initialize()
. (This last assignment is conditional on CompositorOptions::UseAPZ()
, but that corresponds to the parent process check here, and as far as I can tell, we're propagating the same compositor options when creating the PCompositorBridge
actors here.)
I couldn't thus far spot a concrete code path / set of conditions that would lead to mTreeManager
accessed in this way being null, though the number of IPC protocols and messages involved here is large enough that doing an exhaustive analysis is a challenge.
The effect of a null mTreeManager
(were it to occur in a non-debug build where the assertion failure isn't itself fatal) will almost certainly be the GPU process crashing shortly, as soon as the parent process sends it any input event. So, we could consider turning this assertion into something that causes GPU process startup to fail eagerly (i.e. return null from CreateRemoteSession
). I'm not sure what the effect of this on affected CI runs would be -- maybe it would fix the failures because we'd try to re-start the GPU process, and whatever was causing the misbehaviour on the first startup doesn't happen a second time? Or maybe the GPU process crash would just show up as a failure with a different signature?
Comment 22•10 months ago
|
||
Does anyone have theories why the assertion happens only on beta branches? I think it would be another entry point where we can dig into details. I guess there should have some recent code changes behind nightly-only flags.
Comment 23•10 months ago
|
||
(In reply to Hiroyuki Ikezoe (:hiro) from comment #22)
Does anyone have theories why the assertion happens only on beta branches? I think it would be another entry point where we can dig into details. I guess there should have some recent code changes behind nightly-only flags.
My guess is that other bugs listed in the see also of this bug have the same root cause because they all have issues with gpu process early on startup. Those bugs happen on mozilla-central. So if that guess is correct, the underlying issue is not branch-specific, but it manifests differently on beta.
Comment hidden (Intermittent Failures Robot) |
Comment 25•10 months ago
|
||
(In reply to Botond Ballo [:botond] from comment #21)
The code in the GPU process looks up this root LayersId in the
sIndirectLayerTrees
lookup table. This entry should have been added inCompositorBridgeParent::Initialize()
, triggered by this earlier line
The CompositorBridgeParent::Initialize()
runs on the compositor thread. Whereas APZInputBridgeParent::APZInputBridgeParent
runs on the main-thread. So if our IPC call processing is quite fast, then when the ctor of APZInputBridgeParent
gets called, is it possible that CompositorBridgeParent::Initialize()
hasn't yet finished?
If it's possible, just moving this lock at the top of the function would solve the race?
Comment 26•10 months ago
|
||
I pushed a try run with moving the lock, there was an assertion failure. So comment 25 isn't the case hitting the assertion. That means the root LayersId was removed from the sIndirectLayerTrees
or the mParent
got nullified for some reasons.
Comment 27•10 months ago
|
||
(In reply to Hiroyuki Ikezoe (:hiro) from comment #25)
The
CompositorBridgeParent::Initialize()
runs on the compositor thread. WhereasAPZInputBridgeParent::APZInputBridgeParent
runs on the main-thread.
Thanks for pointing this out! I did overlook that they run on different threads.
However, Initialize
is a synchronous message, so the particular issue you describe shouldn't occur.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 months ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Description
•