Closed Bug 1801879 Opened 1 year ago Closed 1 year ago

Crash in [@ NS_CycleCollectorSuspect3 | nsCycleCollectingAutoRefCnt::incr<T>], inside of mozilla::a11y::nsAccUtils::GetSelectableContainer on Android (with an a11y service enabled)

Categories

(Core :: Disability Access APIs, defect, P1)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox107 --- unaffected
firefox108 + fixed
firefox109 --- fixed

People

(Reporter: dholbert, Assigned: Jamie)

References

(Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(2 files, 1 obsolete file)

I'm hitting this crash reliably on Fenix nightly. STR coming up.

Crash report: https://crash-stats.mozilla.org/report/index/55666e6e-9c9e-4938-ad64-c34e40221122

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(data) (Cycle collected object used on a thread without a cycle collector.)

Top 10 frames of crashing thread:

0  libxul.so  NS_CycleCollectorSuspect3  xpcom/base/nsCycleCollector.cpp:3797
1  libxul.so  nsCycleCollectingAutoRefCnt::incr<&NS_CycleCollectorSuspect3>  xpcom/base/nsISupportsImpl.h:248
1  libxul.so  nsCycleCollectingAutoRefCnt::incr<&NS_CycleCollectorSuspect3>  xpcom/base/nsISupportsImpl.h:234
1  libxul.so  nsDocLoader::AddRef  uriloader/base/nsDocLoader.cpp:193
1  libxul.so  nsDocShell::AddRef  docshell/base/nsDocShell.cpp:587
1  libxul.so  {virtual override thunk}  docshell/base/nsDocShell.cpp
2  libxul.so  nsCOMPtr<nsIDocShell>::nsCOMPtr  xpcom/base/nsCOMPtr.h:525
2  libxul.so  nsCoreUtils::GetDocShellFor  accessible/base/nsCoreUtils.cpp:330
3  libxul.so  mozilla::a11y::DocAccessible::NativeRole const  accessible/generic/DocAccessible.cpp:199
4  libxul.so  mozilla::a11y::LocalAccessible::Role const  accessible/generic/LocalAccessible-inl.h:26

STR:

  1. Have an app installed with an a11y service (BitWarden in my case, with BitWarden settings | Auto-fill services | Use Accessibility set to enabled)
  2. Start Firefox Nightly and go to https://crash-stats.mozilla.org/signature/?signature=mozilla%3A%3AFrameProperties%3A%3APropertyComparator%3A%3AEquals&date=%3E%3D2022-11-15T15%3A01%3A00.000Z&date=%3C2022-11-22T15%3A01%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=1
  3. Click the "Reports" tab and click in the textbox that lets you pick which fields you'd like to use
  4. Type "comm" (or "a" probably or any character, really)

ACTUAL RESULTS: Firefox crashes
EXPECTED RESULTS: No crash

I can try to get a regression range later today.

Flags: needinfo?(dholbert)

Hmm, not sure I'll be able to get a regression range, actually. GVE doesn't seem to repro the bug, unfortunately. And mozregression only seems to support gve, not "real" Firefox Nightly for Android.

But I tested in Firefox Nightly on a separate device (a Pixel 4a, on a fresh Android 13 install, with a freshly installed / fresh-profile Firefox Nightly) and it reproduces reliably there.

Flags: needinfo?(dholbert)

FWIW the assertion was added in bug 1776243, in June of this year. Adding a dependency on that bug. But crash data suggests that this started being a problem more-recently than that; the first crash we have here was on Sept 15.

Depends on: 1776243

My patch just made it so we crashed with a nice message. Before that, we just crashed with a null deref in a way that was hard to understand. I think this is more likely a regression from recenty a11y Android work. I think I've seen some bugs on file but I don't have a bug number handy.

Our oldest crash from Sept 15 was bp-21cc0c5c-1e2c-4550-9c8c-50c130220915 and seems to have been built from http://hg.mozilla.org/mozilla-central/rev/e2ce8d3d4a4b01bf3ec52fb795c5f8c392c1476b (based on the source links in its backtrace).

Here's the pushlog for 3 days leading up to that commit, as a tentative regression range:
https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2022-09-11+16%3A00%3A00&enddate=2022-09-14+16%3A00%3A00

(Probably it'd be something in there, since it looks like we've had a few crashes every day since this started happening, so it's likely that the guilty patch landed within the a day or so prior to the first crash report.)

Shot in the dark, looking at that range: maybe this was from bug 1788143, which was related to a11y and comboboxes? The actual component here isn't a combo-box, but it seems like it might be serving as one for a11y purposes; the entries in the autocomplete-dropdown haev role="option", e.g. they look like
<div class="select2-result-label" id="select2-result-label-860" role="option"><span class="select2-match"></span>abort message</div>

Nathan or Jamie, maybe you have ideas here.

Flags: needinfo?(jteh)

(The only other a11y change I'm seeing in my tentative 3-day regression-range pushlog is bug 1788547, but that was just adding a null-check which seems unlikely to introduce a crash here.

I'm not seeing any Android or thread-related stuff there that feels like it could have a connection, but it's entirely possible I'm overlooking something.)

(In reply to Andrew McCreight [:mccr8] from comment #5)

I think this is more likely a regression from recenty a11y Android work.

Possibly related to bug 1765433 (and other cache-the-world a11y android work). That's what the hg-blame for the topmost a11y functions in the backtrace are pointing to, at least -- SessionAccessibility::GetNodeInfo, SessionAccessibility::PopulateNodeInfo. (Though that landed a while back, so there must've been some more recent triggering event as well.)

--> Adding eeejay to ni. Also actually-CC'ing Nathan who I meant to CC in comment 6. :)

Depends on: 1765433
Flags: needinfo?(eitan)
Attached video screencast of bug

Here's a screencast of the bug reproducing, in case it's useful. (starting out showing the options that I have enabled in Bitwarden.)

Summary: Crash in [@ NS_CycleCollectorSuspect3 | nsCycleCollectingAutoRefCnt::incr<T>] → Crash in [@ NS_CycleCollectorSuspect3 | nsCycleCollectingAutoRefCnt::incr<T>], inside of mozilla::a11y::SessionAccessibility::GetNodeInfo on Android (with an a11y service enabled)

Side note: some of these crashes (e.g. the very first one bp-21cc0c5c-1e2c-4550-9c8c-50c130220915) are in SessionAccessibility::CachedPivot, but my crash linked from comment 0 is in SessionAccessibility::GetNodeInfo.

Bug 1765433 was about both sorts of methods, so I suspect it's all related. (That bug's title was Make GetNodeInfo and Pivot methods run safely in Android UI thread)

Also, these crash reports are on the AndroidUI thread; the crash reports say "Crashing Thread (0), Name: AndroidUI". So this does indeed feel pretty closely related to bug 1765433, despite the time delay; and that means hopefully eeejay is the best person to point at this. So I'll clear my earlier ni=jamie to avoid needinfo inflation. :)

Flags: needinfo?(jteh)
Summary: Crash in [@ NS_CycleCollectorSuspect3 | nsCycleCollectingAutoRefCnt::incr<T>], inside of mozilla::a11y::SessionAccessibility::GetNodeInfo on Android (with an a11y service enabled) → Crash in [@ NS_CycleCollectorSuspect3 | nsCycleCollectingAutoRefCnt::incr<T>], inside of SessionAccessibility::GetNodeInfo or SessionAccessibility::CachedPivot, on Android (with an a11y service enabled)
Flags: needinfo?(eitan)
Keywords: regression
Regressed by: 1796733

(In reply to Daniel Holbert [:dholbert] from comment #3)

Hmm, not sure I'll be able to get a regression range, actually. GVE doesn't seem to repro the bug, unfortunately. And mozregression only seems to support gve, not "real" Firefox Nightly for Android.

Turns out Fenix nightlies are available for manual download at https://archive.mozilla.org/pub/fenix/nightly/2022/ (that's the best we've got for Fenix bisecting). Using those with my STR in comment 1, I got this range:
Last good:
https://archive.mozilla.org/pub/fenix/nightly/2022/11/2022-11-10-05-01-35-fenix-108.0b1-android-arm64-v8a/
First bad:
https://archive.mozilla.org/pub/fenix/nightly/2022/11/2022-11-10-17-01-19-fenix-108.0b1-android-arm64-v8a/
Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8baf918fe8e882b3826ddaef4b204f4c1ec63563&tochange=c49bf65cd85aa4925e2851f97abce61b071cf704

(This is much more recent than when we started getting crash data, so I suspect this pushlog contains a commit that just mad this bug easier to repro, or possible to repro using my particular scenario in commnet 1 at least.)

(In reply to Daniel Holbert [:dholbert] from comment #10)

Side note: some of these crashes (e.g. the very first one bp-21cc0c5c-1e2c-4550-9c8c-50c130220915) are in SessionAccessibility::CachedPivot, but my crash linked from comment 0 is in SessionAccessibility::GetNodeInfo.

Your crash linked in comment 0 is caused by bug 1796733, so I'm using this bug to deal with that. There are various other bugs dealing with other instances of this crash. I don't think we have a bug on file for the CachedPivot case, though.

Bug 1765433 was about both sorts of methods, so I suspect it's all related.

It's all related, yes, but the fixes are unfortunately all very different.

Gotcha. Good news: my fenix nightly manual-download-and-test regression range (in comment 11) agrees with you about bug 1796733 being the culprit here.

Thanks for taking a look!

Set release status flags based on info from the regressing bug 1796733

Severity: -- → S2
Priority: -- → P1
Summary: Crash in [@ NS_CycleCollectorSuspect3 | nsCycleCollectingAutoRefCnt::incr<T>], inside of SessionAccessibility::GetNodeInfo or SessionAccessibility::CachedPivot, on Android (with an a11y service enabled) → Crash in [@ NS_CycleCollectorSuspect3 | nsCycleCollectingAutoRefCnt::incr<T>], inside of mozilla::a11y::nsAccUtils::GetSelectableContainer on Android (with an a11y service enabled)

Filed bug 1801986 for the CachedPivot case.

Blocks: a11y-ctw
See Also: → 1801986

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on beta

For more information, please visit auto_nag documentation.

Keywords: topcrash

The fix is trivial. Getting an Android build environment and emulator set up and working and then writing a test to verify the fix is... very non-trivial.

Assignee: nobody → jteh

Trying to access a local OuterDocAccessible from the Android UI thread was causing a crash.
We shouldn't be crossing document boundaries anyway.

This avoids situations in Android where we get parent process local
accessibles off the main thread.

Lol. I didn't realize this was in my review queue before I wrote the exact same fix..

Attachment #9305710 - Attachment is obsolete: true
Pushed by jteh@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/57434a88c266
Don't cross document boundaries in nsAccUtils::GetSelectableContainer. r=eeejay,geckoview-reviewers,owlish

Once this lands in mozilla-central, did you want to nominate this for uplift for fx108?

Flags: needinfo?(jteh)
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch

Comment on attachment 9305238 [details]
Bug 1801879: Don't cross document boundaries in nsAccUtils::GetSelectableContainer.

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes on Android.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Clear logic to restrict a tree walk covered by automated tests.
  • String changes made/needed:
  • Is Android affected?: Yes
Flags: needinfo?(jteh)
Attachment #9305238 - Flags: approval-mozilla-beta?

Comment on attachment 9305238 [details]
Bug 1801879: Don't cross document boundaries in nsAccUtils::GetSelectableContainer.

Approved for 108.0b9

Attachment #9305238 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: