Open Bug 1749109 Opened 2 years ago Updated 5 days ago

Assertion failure: result.mScrollId == ScrollableLayerGuid::NULL_SCROLL_ID, at /builds/worker/checkouts/gecko/gfx/layers/apz/src/WRHitTester.cpp:75

Categories

(Core :: Panning and Zooming, defect, P3)

defect

Tracking

()

REOPENED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: assertion, leave-open)

Attachments

(1 file)

Filed by: malexandru [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=363462339&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/FqPOHRmoTNeTFV_cTPmpLg/runs/0/artifacts/public/logs/live_backing.log


[task 2022-01-08T06:42:54.605Z] 06:42:54     INFO - GECKO(6740) | [Child 7332, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, false) failed with result 0x804B0050 (NS_ERROR_INSUFFICIENT_DOMAIN_LEVELS): file /builds/worker/checkouts/gecko/toolkit/components/antitracking/URLQueryStringStripper.cpp:129
[task 2022-01-08T06:42:54.628Z] 06:42:54     INFO - GECKO(6740) | [Parent 8160, Main Thread] WARNING: WebProgress Ignored: BrowsingContext is null or discarded: file /builds/worker/checkouts/gecko/dom/ipc/BrowserParent.cpp:2961
[task 2022-01-08T06:42:54.769Z] 06:42:54     INFO - GECKO(6740) | [Parent 8160, Main Thread] WARNING: 'aOwner->IsDiscarded()', file /builds/worker/workspace/obj-build/dist/include/mozilla/dom/SyncedContextInlines.h:94
[task 2022-01-08T06:42:54.794Z] 06:42:54     INFO - GECKO(6740) | [GPU 7252, Compositor] WARNING: Possibly dropping task posted to updater thread: file /builds/worker/checkouts/gecko/gfx/layers/apz/src/APZUpdater.cpp:362
[task 2022-01-08T06:42:54.796Z] 06:42:54     INFO - GECKO(6740) | Assertion failure: result.mScrollId == ScrollableLayerGuid::NULL_SCROLL_ID, at /builds/worker/checkouts/gecko/gfx/layers/apz/src/WRHitTester.cpp:75
[task 2022-01-08T06:42:54.802Z] 06:42:54     INFO - GECKO(6740) | [Child 4196, Main Thread] WARNING: '!wgc', file /builds/worker/checkouts/gecko/dom/ipc/jsactor/JSWindowActorProtocol.cpp:260
[task 2022-01-08T06:42:54.894Z] 06:42:54     INFO - GECKO(6740) | #01: mozilla::layers::WRHitTester::GetAPZCAtPoint(mozilla::gfx::PointTyped<mozilla::ScreenPixel,float> const&, mozilla::RecursiveMutexAutoLock const&) [gfx/layers/apz/src/WRHitTester.cpp:75]
[task 2022-01-08T06:42:54.895Z] 06:42:54     INFO - GECKO(6740) | #02: mozilla::layers::APZCTreeManager::ProcessUnhandledEvent(mozilla::gfx::IntPointTyped<mozilla::LayoutDevicePixel>*, mozilla::layers::ScrollableLayerGuid*, unsigned long long*, mozilla::layers::LayersId*) [gfx/layers/apz/src/APZCTreeManager.cpp:2301]
[task 2022-01-08T06:42:54.896Z] 06:42:54     INFO - GECKO(6740) | #03: mozilla::layers::APZInputBridgeParent::RecvProcessUnhandledEvent(mozilla::gfx::IntPointTyped<mozilla::LayoutDevicePixel> const&, mozilla::gfx::IntPointTyped<mozilla::LayoutDevicePixel>*, mozilla::layers::ScrollableLayerGuid*, unsigned long long*, mozilla::layers::LayersId*) [gfx/layers/ipc/APZInputBridgeParent.cpp:132]
[task 2022-01-08T06:42:54.897Z] 06:42:54     INFO - GECKO(6740) | #04: mozilla::layers::PAPZInputBridgeParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) [s3:gecko-generated-sources:c0fa2b6b8527089947076b30eefa0962a7a763b7764862078d40e0e3d1a345257b8de100d3a29f9ce56287c64662df4d913333ed4ab0f50fccfcf1527803a60d/ipc/ipdl/PAPZInputBridgeParent.cpp::565]
[task 2022-01-08T06:42:54.898Z] 06:42:54     INFO - GECKO(6740) | #05: mozilla::ipc::MessageChannel::DispatchSyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&, IPC::Message*&) [ipc/glue/MessageChannel.cpp:2006]
[task 2022-01-08T06:42:54.898Z] 06:42:54     INFO - GECKO(6740) | #06: mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) [ipc/glue/MessageChannel.cpp:1960]
[task 2022-01-08T06:42:54.899Z] 06:42:54     INFO - GECKO(6740) | #07: mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) [ipc/glue/MessageChannel.cpp:1824]
[task 2022-01-08T06:42:54.899Z] 06:42:54     INFO - GECKO(6740) | #08: mozilla::ipc::MessageChannel::MessageTask::Run() [ipc/glue/MessageChannel.cpp:1853]
[task 2022-01-08T06:42:54.900Z] 06:42:54     INFO - GECKO(6740) | #09: mozilla::RunnableTask::Run() [xpcom/threads/TaskController.cpp:469]
[task 2022-01-08T06:42:54.901Z] 06:42:54     INFO - GECKO(6740) | #10: mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex &> const&) [xpcom/threads/TaskController.cpp:771]
[task 2022-01-08T06:42:54.901Z] 06:42:54     INFO - GECKO(6740) | #11: mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex &> const&) [xpcom/threads/TaskController.cpp:607]
[task 2022-01-08T06:42:54.902Z] 06:42:54     INFO - GECKO(6740) | #12: mozilla::TaskController::ProcessPendingMTTask(bool) [xpcom/threads/TaskController.cpp:391]
[task 2022-01-08T06:42:54.902Z] 06:42:54     INFO - GECKO(6740) | #13: mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:124:7'>::Run() [xpcom/threads/nsThreadUtils.h:532]
[task 2022-01-08T06:42:54.903Z] 06:42:54     INFO - GECKO(6740) | #14: nsThread::ProcessNextEvent(bool, bool*) [xpcom/threads/nsThread.cpp:1199]
[task 2022-01-08T06:42:54.903Z] 06:42:54     INFO - GECKO(6740) | #15: NS_ProcessNextEvent(nsIThread*, bool) [xpcom/threads/nsThreadUtils.cpp:467]
[task 2022-01-08T06:42:54.904Z] 06:42:54     INFO - GECKO(6740) | #16: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:85]
[task 2022-01-08T06:42:54.904Z] 06:42:54     INFO - GECKO(6740) | #17: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:325]
[task 2022-01-08T06:42:54.905Z] 06:42:54     INFO - GECKO(6740) | #18: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:307]
[task 2022-01-08T06:42:54.905Z] 06:42:54     INFO - GECKO(6740) | #19: nsBaseAppShell::Run() [widget/nsBaseAppShell.cpp:139]
[task 2022-01-08T06:42:54.905Z] 06:42:54     INFO - GECKO(6740) | #20: nsAppShell::Run() [widget/windows/nsAppShell.cpp:603]
[task 2022-01-08T06:42:54.906Z] 06:42:54     INFO - GECKO(6740) | #21: XRE_RunAppShell() [toolkit/xre/nsEmbedFunctions.cpp:864]
[task 2022-01-08T06:42:54.906Z] 06:42:54     INFO - GECKO(6740) | #22: mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:235]
[task 2022-01-08T06:42:54.906Z] 06:42:54     INFO - GECKO(6740) | #23: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:325]
[task 2022-01-08T06:42:54.907Z] 06:42:54     INFO - GECKO(6740) | #24: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:307]
[task 2022-01-08T06:42:54.907Z] 06:42:54     INFO - GECKO(6740) | #25: XRE_InitChildProcess(int, char**, XREChildData const*) [toolkit/xre/nsEmbedFunctions.cpp:705]
[task 2022-01-08T06:42:54.907Z] 06:42:54     INFO - GECKO(6740) | #26: NS_internal_main(int, char**, char**) [browser/app/nsBrowserApp.cpp:327]
[task 2022-01-08T06:42:54.908Z] 06:42:54     INFO - GECKO(6740) | #27: wmain(int, wchar_t**) [toolkit/xre/nsWindowsWMain.cpp:147]
[task 2022-01-08T06:42:54.908Z] 06:42:54     INFO - GECKO(6740) | #28: __scrt_common_main_seh() [/builds/worker/workspace/obj-build/browser/app/d:/agent/_work/1/s/src/vctools/crt/vcstartup/src/startup/exe_common.inl:288]
[task 2022-01-08T06:42:54.909Z] 06:42:54     INFO - GECKO(6740) | #29: BaseThreadInitThunk [C:\Windows\System32\KERNEL32.DLL + 0x17034]
[task 2022-01-08T06:42:54.909Z] 06:42:54     INFO - GECKO(6740) | #30: RtlUserThreadStart [C:\Windows\SYSTEM32\ntdll.dll + 0x52651]
[task 2022-01-08T06:42:55.522Z] 06:42:55     INFO - GECKO(6740) | [Parent 8160, IPC I/O Parent] WARNING: [1.1]: Ignoring message 'EVENT_MESSAGE' to unknown peer 766295460326B3CC.4BDEAEF9757F4304: file /builds/worker/checkouts/gecko/ipc/glue/NodeController.cpp:341
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

I can get this assert to trigger on my debug build pretty reliably:

  1. Go to about:config
  2. Bring up the print dialog
  3. Wait - triggering isn't instant, but it seems to trigger for me within 10 seconds. Otherwise, clicking the area outside of the print dialog seems to be another reliable trigger.

I just hit this locally as well. In my case, I had also visited about:config and did Ctrl+P, and I had an "always on top" terminal-window floating in the foreground, and I think I was trying to mousewheel scroll on the print-preview section of the print dialog.

Not sure if values are useful, but FWIW in my case result.mScrollId was 7, whereas ScrollableLayerGuid::NULL_SCROLL_ID is 0.

There's a long comment before the assertion saying that it's sort-of-expected to fail in non-fission scenarios, and there's a non-fatal version of the assertion for those scenarios (from bug 1634763). I wonder if my & David's situation with about:config (which importantly is rendered in the parent process) is functionally equivalent to the "non-fission" scenario, and we need to adjust the fatal assertion to account for that?

botond, maybe you could take a look here -- looks like you're the last to touch the code in question, according to searchfox.

Flags: needinfo?(botond)

I captured this in rr and submitted it to pernosco, using the STR from comment 21 (with me mousewheel scrolling over the print dialog, and with an always-on-top terminal window floating at the side of my display; again, I'm not entirely sure if those are necessary or not).

In comment 24 and in my rr recording, I was using a debug+opt build, i.e. --enable-debug --enable-optimize=O1. I also tried with a --disable-optimize build, for more-reliable variable visibility, but it turned out I always hit a different fatal-assertion first. :)

(In reply to Daniel Holbert [:dholbert] from comment #24)

I wonder if my & David's situation with about:config (which importantly is rendered in the parent process) is functionally equivalent to the "non-fission" scenario, and we need to adjust the fatal assertion to account for that?

I don't think so -- in order for us to be in the "expected to fail" case, this function would have to return false, but with our default pref values that only happens if FissionAutostart() returns false, and that should not happen even in the parent process (and the Pernosco recording seems to confirm this).

We have a similar bug open, bug 1690819, which may have the same underlying cause.

These bugs tend to be tricky to investigate, because they basically indicate a discrepancy between two different data structures in different places (data structures derived from the WebRender display list in the Rust part of the compositor, and data structures derived from the WebRenderScrollData in C++ APZ code), that are both ultimately based on the Gecko display list on the main thread, but there are various processing stages where there is a possibility of the discrepancy being introduced.

I will seek to prioritize investigation of bug 1690819 + this bug together early next year. The Pernosco recording should be helpful, thanks for that!

Flags: needinfo?(botond)
See Also: → 1690819

Quick update on this based on a recent discussion in our APZ planning meeting:

  • Timothy has a theory for what might be causing these assertions, related to there being two subtly different notions of an "active layer" in the Gecko display list.
  • We're planning to do at least enough investigation to confirm this, but if the theory is correct, fixing that underlying problem is likely to be lower priority because it's somewhat of a tricky fix to get right (e.g. likely to require combing through history so we can be confident we're not regressing something else), and any correctness impact from the discrepancy is likely to be limited to scrollbar dragging being affected in infrequent edge cases.

Specifically I remember debugging an instance of and it looked like the cause was this call to IsScrollingActive when adding scrollbar parts

https://searchfox.org/mozilla-central/rev/72d9bf2d4128990cd1f349d4e15003e515277982/layout/generic/nsGfxScrollFrame.cpp#3600

IsScrollingActive is only used in two places in the tree anymore, it would be good to get rid of it. Usually when we say a scrollframe is active we mean that is has a displayport and so can be async scrolled on the compositor.

Status: REOPENED → RESOLVED
Closed: 2 years ago1 year ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

If this helps, I've noticed this bug reproduces consistently on Thunderbird's debug builds, when the mouse cursor moves to an edge of the print preview modal (if the email being printed fits within a single page).

I've attached a patch for comm-central with a test that reproduces this crash, and here is an example of it failing on try-comm-central.

Assignee: nobody → brendan
Assignee: brendan → nobody
Comment on attachment 9348472 [details] [diff] [review]
print_preview_crash_test.patch

[I tried to edit the patch in place, which didn't work but this wasn't an important change anyway, so please ignore it]
Status: REOPENED → RESOLVED
Closed: 1 year ago6 months ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 6 months ago4 months ago
Resolution: --- → INCOMPLETE

I happened to capture this again in pernosco just now, in a debug build from today's mozilla-central. (I was trying to reproduce a different bug and tripped over this one instead.) --> Reopening.

Pernosco trace: https://pernos.co/debug/toE_lWri_mPbR_DPUjStQg/index.html

Before hitting the crash, I was interacting with about:home, possibly having just navigated to or back-from the YouTube shortcut-tile that appears on that page.

tnikkel/botond, it looks like you investigated this a year ago (comment 30 - 31), and it sounds like we were thinking that the underlying issue isn't particularly worrisome and not a priority to fix. If that still seems true, could we perhaps relax this assertion to be nonfatal (i.e. NS_ASSERTION) so that it doesn't continue to interrupt debugging sessions that are trying to capture unrelated issues?

Status: RESOLVED → REOPENED
Flags: needinfo?(botond)
Resolution: INCOMPLETE → ---
Flags: needinfo?(tnikkel)

(In reply to Daniel Holbert [:dholbert] from comment #63)

tnikkel/botond, it looks like you investigated this a year ago (comment 30 - 31), and it sounds like we were thinking that the underlying issue isn't particularly worrisome and not a priority to fix. If that still seems true, could we perhaps relax this assertion to be nonfatal (i.e. NS_ASSERTION) so that it doesn't continue to interrupt debugging sessions that are trying to capture unrelated issues?

As discussed in comment 29, this assertion is useful for catching discrepancies between the "Gecko display list --> WebRenderScrollData --> APZ HitTestingTree" pipeline and the "Gecko display list --> WebRender display list --> WebRender scene" pipeline; since the compositor hit test operates on a combination of the APZ HitTestingTree and the WebRender scene, it's important that the two remain consistent.

I believe what we've found in comments 30-31 is that an instance of the assertion that Timothy debugged was related to IsScrollingActive, and that the correctness impact of this is likely to be low ("limited to scrollbar dragging being affected in infrequent edge cases").

However, the assertion has the potential to catch other discrepancies between these two pipelines, including new ones that might be introduced as a regression. Particularly with larger display list refactorings such as bug 1730749 and FFXP-2497 on the horizon, which increase the risk of introducing such inconsistencies, I'd prefer to keep the assertion around for its regression-catching potential.

So, if we're finding that the existing instances of the assertion are interfering with other development tasks, I would lean towards trying to get investigating and fixing that underlying IsScrollingActive issue back on the agenda, rather than downgrading the assertion.

I'm curious to hear Timothy's thoughts as well.

Flags: needinfo?(botond)

Thanks! That all makes sense.

I'm not sure how easy this is to trigger (I just tripped over it this one time recently by chance); but hopefully the pernosco trace in comment 63 is actionable enough to at least figure out what was going on there, if one of you has cycles to take a look.

I would concur with Botond. The reason I was active in this bug was related the scrollbars and IsScrollingActive is cannot be a cause of this assert when fission is enabled, so you were likely seeing something else, and it does seem important to note when hitting those kind of cases.

Flags: needinfo?(tnikkel)

Bumping severity. We're tracking this as part of a group of APZ stability issues we're hoping to fix in H1 (FFXP-2132).

Severity: S4 → S3
Priority: P5 → P3
Status: REOPENED → RESOLVED
Closed: 4 months ago2 months ago
Resolution: --- → INCOMPLETE

I'm going to re-open this. While the issue may only occur very infrequently in CI, developers do occasionally run into this when running debug builds locally, and the posted Pernosco traces (comment 26, comment 63) should make this actionable. It's on our roadmap to investigate this as mentioned in comment 69.

(Setting a combination of flags that will hopefully prevent BugBot from closing it again.)

Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Summary: Intermittent Assertion failure: result.mScrollId == ScrollableLayerGuid::NULL_SCROLL_ID, at /builds/worker/checkouts/gecko/gfx/layers/apz/src/WRHitTester.cpp:75 → Assertion failure: result.mScrollId == ScrollableLayerGuid::NULL_SCROLL_ID, at /builds/worker/checkouts/gecko/gfx/layers/apz/src/WRHitTester.cpp:75
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: