Open Bug 1858436 Opened 1 year ago Updated 1 month ago

Frequent beta Assertion failure: mTreeManager, at /builds/worker/checkouts/gecko/gfx/layers/ipc/APZInputBridgeParent.cpp:35

Categories

(Core :: Graphics, defect)

defect

Tracking

()

Tracking Status
firefox121 --- affected

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: assertion, intermittent-failure)

Filed by: smolnar [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=432097331&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VslFPZKZSlS2MpwJoXlEPw/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VslFPZKZSlS2MpwJoXlEPw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


TEST-START | /infrastructure/reftest/reftest_timeout.html
[task 2023-10-11T11:33:53.917Z] 11:33:53     INFO - TEST-SKIP | /infrastructure/reftest/reftest_timeout.html | took 0ms
[task 2023-10-11T11:33:53.930Z] 11:33:53     INFO - Using 1 child processes
[task 2023-10-11T11:33:53.960Z] 11:33:53     INFO - Setting up ssl
[task 2023-10-11T11:33:54.778Z] 11:33:54     INFO - certutil | b''
[task 2023-10-11T11:33:54.977Z] 11:33:54     INFO - certutil | b''
[task 2023-10-11T11:33:55.006Z] 11:33:55     INFO - certutil | b'\nCertificate Nickname                                         Trust Attributes\n                                                             SSL,S/MIME,JAR/XPI\n\nweb-platform-tests                                           CT,, \r\n'
[task 2023-10-11T11:33:55.018Z] 11:33:55     INFO - Application command: Z:\task_169702309243806\build\application\firefox\firefox.exe -marionette about:blank --wait-for-browser -profile C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt
[task 2023-10-11T11:33:55.169Z] 11:33:55     INFO - Application command: Z:\task_169702309243806\build\application\firefox\firefox.exe -marionette about:blank --wait-for-browser -profile C:\Users\task_169702309243806\AppData\Local\Temp\tmpbwmmg7l4
[task 2023-10-11T11:33:55.175Z] 11:33:55     INFO - Starting runner
[task 2023-10-11T11:34:02.753Z] 11:34:02     INFO - PID 6604 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt\runtests_leaks_8948.log
[task 2023-10-11T11:34:03.723Z] 11:34:03     INFO - PID 6604 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt\runtests_leaks_8948_gpu_pid5872.log
[task 2023-10-11T11:34:03.820Z] 11:34:03     INFO - PID 6604 | 1697024043819	Marionette	INFO	Marionette enabled
[task 2023-10-11T11:34:04.016Z] 11:34:04     INFO - PID 6604 | 1697024044015	Marionette	INFO	Listening on port 61096
[task 2023-10-11T11:34:04.829Z] 11:34:04     INFO - PID 6604 | [Parent 1220, Main Thread] WARNING: Failed to retarget HTML data delivery to the parser thread.: file /builds/worker/checkouts/gecko/parser/html/nsHtml5StreamParser.cpp:1224
[task 2023-10-11T11:34:04.839Z] 11:34:04     INFO - PID 6604 | [WARN  rkv::backend::impl_safe::environment] `load_ratio()` is irrelevant for this storage backend.
[task 2023-10-11T11:34:04.982Z] 11:34:04     INFO - PID 6604 | Initializing context 000001E5F68C2000 surface 000001E5F6856350 on display 000001E5F684C100
[task 2023-10-11T11:34:04.983Z] 11:34:04     INFO - PID 6604 | [GPU 5872, Renderer] WARNING: Failed to make GL context current!: file /builds/worker/checkouts/gecko/gfx/gl/GLContextProviderEGL.cpp:473
[task 2023-10-11T11:34:04.983Z] 11:34:04     INFO - PID 6604 | EGL Error: 0x3009
[task 2023-10-11T11:34:04.984Z] 11:34:04     INFO - PID 6604 | [GPU 5872, Renderer] WARNING: GLContext::InitWithPrefix failed!: file /builds/worker/checkouts/gecko/gfx/gl/GLContext.cpp:330
[task 2023-10-11T11:34:04.985Z] 11:34:04     INFO - PID 6604 | Destroying context 000001E5F68C2000 surface 000001E5F6856350 on display 000001E5F684C100
[task 2023-10-11T11:34:04.986Z] 11:34:04     INFO - PID 6604 | [GPU 5872, Renderer] WARNING: Failed to create GLContext from PBuffer: file /builds/worker/checkouts/gecko/gfx/gl/GLContextProviderEGL.cpp:1181
[task 2023-10-11T11:34:04.986Z] 11:34:04     INFO - PID 6604 | Initializing context 000001E5F68D6000 surface 000001E5F68570D0 on display 000001E5F684C100
[task 2023-10-11T11:34:04.987Z] 11:34:04     INFO - PID 6604 | GL_VENDOR: Google Inc. (NVIDIA)
[task 2023-10-11T11:34:04.987Z] 11:34:04     INFO - PID 6604 | mVendor: NVIDIA
[task 2023-10-11T11:34:04.988Z] 11:34:04     INFO - PID 6604 | GL_RENDERER: ANGLE (NVIDIA, NVIDIA Tesla M60 Direct3D11 vs_5_0 ps_5_0, D3D11-30.0.14.7239)
[task 2023-10-11T11:34:04.988Z] 11:34:04     INFO - PID 6604 | mRenderer: Unknown
[task 2023-10-11T11:34:04.988Z] 11:34:04     INFO - PID 6604 | mIsMesa: 0
[task 2023-10-11T11:34:05.290Z] 11:34:05     INFO - PID 6604 | [Parent 1220, GMPThread] WARNING: Failed to delete GMP storage directory: file /builds/worker/checkouts/gecko/dom/media/gmp/GMPServiceParent.cpp:1863
[task 2023-10-11T11:34:05.312Z] 11:34:05     INFO - PID 6604 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt\runtests_leaks_8948_tab_pid1252.log
[task 2023-10-11T11:34:26.128Z] 11:34:26     INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:26.148Z] 11:34:26     INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:26.149Z] 11:34:26     INFO - PID 6604 | [WARN  webrender::device::gl] Missing optimized shader source for gpu_cache_update
INFO - PID 6604 | [Parent 1220, Main Thread] WARNING: 'aOwner->IsDiscarded()', file /builds/worker/workspace/obj-build/dist/include/mozilla/dom/SyncedContextInlines.h:95
[task 2023-10-11T11:34:26.260Z] 11:34:26     INFO - PID 6604 | [GPU 5872, Compositor] WARNING: Possibly dropping task posted to updater thread: file /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZUpdater.cpp:371
[task 2023-10-11T11:34:26.301Z] 11:34:26     INFO - PID 6604 | [Child 1252, Main Thread] WARNING: IPC message 'PContent::Msg_CleanupPendingLoadState' discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:545
[task 2023-10-11T11:34:26.400Z] 11:34:26     INFO - PID 6604 | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to C:\Users\task_169702309243806\AppData\Local\Temp\tmpdhk_rjjt\runtests_leaks_8948_tab_pid5432.log
[task 2023-10-11T11:34:39.194Z] 11:34:39     INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:39.196Z] 11:34:39     INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:39.258Z] 11:34:39     INFO - PID 6604 | [GPU 5872, COM MTA] WARNING: MFTEnumEx returned no IMFActivate instances: file /builds/worker/checkouts/gecko/dom/media/platforms/wmf/MFTDecoder.cpp:81
[task 2023-10-11T11:34:39.259Z] 11:34:39     INFO - PID 6604 | Assertion failure: mTreeManager, at /builds/worker/checkouts/gecko/gfx/layers/ipc/APZInputBridgeParent.cpp:35
[task 2023-10-11T11:34:39.780Z] 11:34:39     INFO - STDOUT: Initializing stack-fixing for the first stack frame, this may take a while...
[task 2023-10-11T11:34:51.352Z] 11:34:51     INFO - TEST-START | /acid/acid2/reftest.html
Severity: S4 → --
Component: web-platform-tests → Graphics
Priority: P5 → --
Product: Testing → Core
Summary: Intermittent Assertion failure: mTreeManager, at /builds/worker/checkouts/gecko/gfx/layers/ipc/APZInputBridgeParent.cpp:35 → Frequent beta Assertion failure: mTreeManager, at /builds/worker/checkouts/gecko/gfx/layers/ipc/APZInputBridgeParent.cpp:35

Bob, could you have a look over this one or redirect to someone else? Thank you.
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=437342206&repo=mozilla-beta&lineNumber=2668

Flags: needinfo?(bhood)

Despite the gfx in the path of the failing assert, it seems like this is probably more in apz land?

Flags: needinfo?(botond)
Flags: needinfo?(bhood)

RE [stockwell disable-recommended], I don't think there's any test we can disable here. :-/

This looks like it's a fatal assertion-failure that happens at startup, if I'm reading the logs correctly. Presumably there's some sort of race condition at startup that we're now on the losing side of, due to some change that was recently made that works out differently on Nightly vs. Beta, or something along those lines?

We also can't simply remove/downgrade the assertion itself, because if it fails, we're potentially going to crash with a null-deref a bit later. (It's asserting that a pointer is non-null, and then we boldly proceed to use that pointer in other methods in the same class). So unless the failing situation is for a one-off never-actually used object or something, then the assertion is about an invariant that looks to be important to avoid further crashes.

botond: I wonder if you've had a chance to look and have any ideas here... We probably don't want to leave beta (and probably release?) in this state indefinitely.

Per APZ/gfx sync meeting, I think this may be a failure specific to CI running on systems with a GPU, the datacenter GPUs like this one can have extremely different latency profiles because they are remote network devices (a whole rack dedicated to GPUs, separate from the regular CPU instances that talk to them), so there may be a race condition during startup where a call blocks when we don't expect it or vice versa, it may be worth forcing these tests to run in software just to see if it affects failure rate of this assert and determine if that tickled the bug, if so we may be able to modify the GPU process startup to accommodate something we didn't expect previously from D3D11 API behavior.

This feels similar to bug 1854633 and bug 1839589. These all seem to be happening on the same machines (usually win32 debug). The problem seems to be something early in the startup of the gpu process. I suspect they all have the same root cause. It seems that the infra changes around bug 1863630 might have triggered this.

See Also: → 1854633
See Also: → 1864255
See Also: → 1839589

Bug 1868357 seems similar, too (in that case, we failed some reftests due to unexpectedly using software-WR, after Killing GPU process due to IPC reply timeout early-on). Comment 18 here seems suspiciously relevant to what may have happened there.

See Also: → 1868357

Yes, the assertion failure seems to indicate something going wrong during GPU process startup.

The assertion failure happens in code running in the GPU process during its startup. The code is triggered by the parent process sending a PGPU::InitAPZInputBridge message in GPUProcessManager::CreateRemoteSession(). Note that the LayersId argument sent is aRootLayerTreeId, which is the LayersId representing the window's root layer tree (the one rendered by the parent process); a child LayersId (representing content rendered by a content process) is never sent.

The code in the GPU process looks up this root LayersId in the sIndirectLayerTrees lookup table. This entry should have been added in CompositorBridgeParent::Initialize(), triggered by this earlier line in CreateRemoteSession(); the entry has its mParent field populated at the same time. Finally, mParent->mApzcTreeManager is also populated in CompositorBridgeParent::Initialize(). (This last assignment is conditional on CompositorOptions::UseAPZ(), but that corresponds to the parent process check here, and as far as I can tell, we're propagating the same compositor options when creating the PCompositorBridge actors here.)

I couldn't thus far spot a concrete code path / set of conditions that would lead to mTreeManager accessed in this way being null, though the number of IPC protocols and messages involved here is large enough that doing an exhaustive analysis is a challenge.

The effect of a null mTreeManager (were it to occur in a non-debug build where the assertion failure isn't itself fatal) will almost certainly be the GPU process crashing shortly, as soon as the parent process sends it any input event. So, we could consider turning this assertion into something that causes GPU process startup to fail eagerly (i.e. return null from CreateRemoteSession). I'm not sure what the effect of this on affected CI runs would be -- maybe it would fix the failures because we'd try to re-start the GPU process, and whatever was causing the misbehaviour on the first startup doesn't happen a second time? Or maybe the GPU process crash would just show up as a failure with a different signature?

Flags: needinfo?(botond)

Does anyone have theories why the assertion happens only on beta branches? I think it would be another entry point where we can dig into details. I guess there should have some recent code changes behind nightly-only flags.

(In reply to Hiroyuki Ikezoe (:hiro) from comment #22)

Does anyone have theories why the assertion happens only on beta branches? I think it would be another entry point where we can dig into details. I guess there should have some recent code changes behind nightly-only flags.

My guess is that other bugs listed in the see also of this bug have the same root cause because they all have issues with gpu process early on startup. Those bugs happen on mozilla-central. So if that guess is correct, the underlying issue is not branch-specific, but it manifests differently on beta.

(In reply to Botond Ballo [:botond] from comment #21)

The code in the GPU process looks up this root LayersId in the sIndirectLayerTrees lookup table. This entry should have been added in CompositorBridgeParent::Initialize(), triggered by this earlier line

The CompositorBridgeParent::Initialize() runs on the compositor thread. Whereas APZInputBridgeParent::APZInputBridgeParent runs on the main-thread. So if our IPC call processing is quite fast, then when the ctor of APZInputBridgeParent gets called, is it possible that CompositorBridgeParent::Initialize() hasn't yet finished?

If it's possible, just moving this lock at the top of the function would solve the race?

I pushed a try run with moving the lock, there was an assertion failure. So comment 25 isn't the case hitting the assertion. That means the root LayersId was removed from the sIndirectLayerTrees or the mParent got nullified for some reasons.

(In reply to Hiroyuki Ikezoe (:hiro) from comment #25)

The CompositorBridgeParent::Initialize() runs on the compositor thread. Whereas APZInputBridgeParent::APZInputBridgeParent runs on the main-thread.

Thanks for pointing this out! I did overlook that they run on different threads.

However, Initialize is a synchronous message, so the particular issue you describe shouldn't occur.

See Also: → 1871960
See Also: → 1877322
Whiteboard: [stockwell disable-recommended]
See Also: 1864255
See Also: → 1721937
See Also: → 1896569
See Also: → 1897111
See Also: → 1899476
See Also: → 1899541
See Also: → 1897534
See Also: → 1904295
See Also: → 1904552
See Also: 1904552
See Also: → 1905236
See Also: → 1914795
See Also: → 1914800
You need to log in before you can comment on or make changes to this bug.