Closed Bug 1771931 Opened 2 years ago Closed 2 years ago

Crash in [@ OOM | large | NS_ABORT_OOM | nsTArray_base<T>::EnsureCapacity<T> | mozilla::a11y::CachedTableAccessible::EnsureRowCol]

Categories

(Core :: Disability Access APIs, defect)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
103 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox101 --- unaffected
firefox102 --- disabled
firefox103 --- fixed

People

(Reporter: mccr8, Assigned: Jamie)

References

Details

(Keywords: crash, regression, Whiteboard: [ctw-m1])

Crash Data

Attachments

(4 files)

Crash report: https://crash-stats.mozilla.org/report/index/d8cdd94e-30b9-4d8b-b3fc-b4b110220528

MOZ_CRASH Reason: MOZ_CRASH(OOM)

Top 10 frames of crashing thread:

0 libxul.so NS_ABORT_OOM xpcom/base/nsDebugImpl.cpp:678
1 libxul.so nsTArrayInfallibleAllocator::ResultTypeProxy nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_RelocateUsingMemutils>::EnsureCapacity<nsTArrayInfallibleAllocator> xpcom/ds/nsTArray-inl.h:167
2 libxul.so mozilla::a11y::CachedTableAccessible::EnsureRowCol accessible/base/CachedTableAccessible.cpp:219
3 libxul.so mozilla::a11y::CachedTableAccessible::GetFrom accessible/base/CachedTableAccessible.cpp:90
4 libxul.so mozilla::a11y::AccGroupInfo::TotalItemCount accessible/base/AccGroupInfo.cpp:229
5 libxul.so mozilla::a11y::RemoteAccessibleBase<mozilla::a11y::RemoteAccessible>::Attributes accessible/ipc/RemoteAccessibleBase.cpp:664
6 libxul.so mozilla::a11y::RemoteAccessible::Attributes accessible/ipc/other/RemoteAccessible.cpp:72
7 libxul.so mozilla::a11y::SessionAccessibility::PopulateNodeInfo accessible/android/SessionAccessibility.cpp:911
8 libxul.so void mozilla::jni::NativeStub<mozilla::java::SessionAccessibility::NativeProvider::GetNodeInfo_t, mozilla::a11y::SessionAccessibility, mozilla::jni::Args<int, mozilla::jni::Ref<mozilla::jni::Object, _jobject*> const&> >::Wrap<&mozilla::a11y::SessionAccessibility widget/android/jni/Natives.h:1447
9 base.odex base.odex@0x0000000000182088 

This looks like an Android-specific OOM. The OOM allocation size is exactly the same for all of these crashes (around 2.21GB) for all of these, which seems suspicious. The volume isn't super high, but I also imagine there can't be that many people using accessibility code on Android Nightly.

Jamie, I assume this could be a regression from some of your cache-the-world work, given the stack.

Flags: needinfo?(jteh)
Keywords: regression

I'm also looking at this content process crash, which is similar but slightly different:
bp-6bee2ea7-f6e0-4b61-a68e-a5e2f0220530
I'm hoping you can help me clarify some things here.
Does OOM Allocation Size refer to the amount of memory we just attempted to allocate that led to the OOM? So that would suggest the array (in the case of the parent process crash in comment 0) or the hash table (in the case of the content process crash I linked above) grew to be gigabytes in size? That's really puzzling to me, particularly the hash table case. Or is there another way to interpret this?

Flags: needinfo?(jteh) → needinfo?(continuation)
Blocks: a11y-ctw
Whiteboard: [ctw-m1]

I found a few places where things probably aren't being cleaned up properly, but nothing that would explain allocs this large:

  1. When we shut down a local DocAccessible, we call Shutdown on all its descendants, but we don't call UnbindFromParent or UncacheChildrenInSubtree. Those are the places where cached tables get invalidated. We'll need to explicitly invalidate cached tables in that case.
  2. Similarly, when we shut down a remote DocAccessibleParent, we destroy all the descendants without calling Shutdown. Shutdown is where cached tables get invalidated.

(In reply to James Teh [:Jamie] from comment #3)

  1. When we shut down a local DocAccessible, we call Shutdown on all its descendants, but we don't call UnbindFromParent

I'm wrong. LocalAccessible::Shutdown calls UnbindFromParent for all its children. It probably makes more sense to move this out of UnbindFromParent and into Shutdown, but regardless, it does get called.

  1. Similarly, when we shut down a remote DocAccessibleParent, we destroy all the descendants without calling Shutdown. Shutdown is where cached tables get invalidated.

I confirmed this is definitely a leak.

(In reply to James Teh [:Jamie] from comment #2)

I'm also looking at this content process crash, which is similar but slightly different:
bp-6bee2ea7-f6e0-4b61-a68e-a5e2f0220530

The crashes with that signature seem to all have the exact same OOM allocation size of about 1.04GB.

Does OOM Allocation Size refer to the amount of memory we just attempted to allocate that led to the OOM? So that would suggest the array (in the case of the parent process crash in comment 0) or the hash table (in the case of the content process crash I linked above) grew to be gigabytes in size? That's really puzzling to me, particularly the hash table case. Or is there another way to interpret this?

OOM allocation size is the value that was passed in to malloc.

The fact that all of the sizes are exactly the same suggests to me that this isn't some kind of data structure that is leaking and growing indefinitely, until finally we run out of memory. In that case, the OOM allocation sizes would likely be different, depending on how much memory the user has, what other programs are running, etc. Instead, each of these crashes has exactly one OOM allocation size. This suggests to me that something else is happening, like some data structure isn't being initialized properly, and then you are getting some kind of integer underflow. I don't know how that could happen, though, given that in both cases you are using standard Firefox data structures where you aren't even dealing with the internals.

The OOM allocation size for the hashtable crash is 0x3E000000 in hex. The OOM allocation size for the array crash is 0x839FFFFC. That doesn't really have any obvious cause to me, though.

At least for the hashtable, I don't see how you'd get a use-after-free, another source of possible issues.

Another possible way to corrupt data structures would be a race (though again I'd imagine that wouldn't end up ending precisely on the same value all of the times). Sometimes you can look through crash reports and get lucky and see another thread touching the same data structure as the crashing thread. I looked through about a dozen crash reports and found a couple that looked a bit suspicious.

bp-dc19738e-5fff-434e-b64d-873380220529 In this crash report, thread 17 (visible if you select "show other threads") looks to be waiting on a mutex inside DocAccessibleParent::RecvCache(). Is it possible that some of the deserialization code that ran before it touched the same CachedTableAccessible data structure without holding a lock?

bp-31aca5e7-5707-4ab4-a284-4961d0220529 In this crash report, thread 15 looks to be waiting on a mutex inside DocAccessibleParent::RecvShutdown(). In this case, there's no prior deserialization code, so I don't know how this could cause issues with a data structure on another thread.

I didn't find anything like that for the hashtable crash, so maybe this race idea is a big red herring.

Flags: needinfo?(continuation)

Nika, any ideas how we might end up with a precise large OOM allocation size across multiple crashes for PLDHashtable and nsTArray? I guess it could just be coincidence that with the growth strategy this is always the point where it gets to be huge, given the constraints of Android, but it seems odd to me.

Flags: needinfo?(nika)

nsTArray's allocation sizes do take on very predictable intervals, so given that these tables appear to be allocated without doing any explicit SetCapacity calls, I would expect them to follow the same pattern.

Based on the specific threshold of approximately 2Gb, I'm expecting that these OOMs are actually caused by the IsTwiceTheSizeRequriedBytesRepresentableAsUint32 assertion here: https://searchfox.org/mozilla-central/rev/3419858c997f422e3e70020a46baae7f0ec6dacc/xpcom/ds/nsTArray-inl.h#161-169, which lines up with the call-site in the crash report. This error occurs if code tries to allocate over a 2Gb buffer, and aborts early to avoid needing to do overflow handling in the buffer growth algorithm (as well as because the size field of a nsTArray is only 32-bits, and a larger size may not fit).

My guess is that these sizes are the first in the normal sequence of allocation growths which are large enough that they fail that check, leading to an OOM whether or not the actual system has enough memory for the allocation to be successful. This makes some amount of sense, because I'd expect that OOMs on android are probably usually handled by overcommiting and then killing the browser when we run out of memory, rather than us crashing, most of the time.

In terms of what is causing us to grow the table so large, I'm specifically quite suspicious of EnsureRow and EnsureRowCol. These methods take in unsigned integers, and call AppendElement in a loop to grow the array to be large enough that array[a{Row,Col}Index] is valid. Because they don't call SetCapacity or AppendElements, this means we'll see normal growth patterns as to the array it looks like each element is being added individually. My vague guess is that some math around rows & columns underflowed such that a{Row,Col}Index is a very large number, and so we got stuck in a loop appending 2 billion individual entries to the array before we hit the error condition and crashed.

In addition to fixing the potential underflow issue, the code should probably be changed to use SetCapacity to grow ahead of time. EnsureRow could also be changed to use mRowColToCellIdx.AppendElements(aRowIndex - mRowColToCellIdx.Length()), though the default value for cells appears to be wrong.

Flags: needinfo?(nika)

Thanks. Yeah, I suspected an underflow too, but I just don't see how we could be underflowing. One way that could happen is if we got a col/row span of 0, but we get that info from layout and layout is never supposed to return 0 there. I guess I'll add an assert and a safety check and see what happens.

That still doesn't explain the hash table crash, though.

When a DocAccessibleParent is destroyed, we don't call Shutdown on its RemoteAccessibles to avoid pointless cleanup overhead.
Previously, this meant we weren't cleaning up associated CachedTableAccessibles.
We now explicitly clean these up in DocAccessibleParent::Destroy.

Assignee: nobody → jteh
Status: NEW → ASSIGNED

We're seeing crashes in the wild where we're trying to allocate a very large number of array elements.
This suggests that the column index is underflowing.
The only way I can think of that this could happen is if colspan is 0 on the first cell, in which case we'd do 0 + 0 - 1 and underflow.
That shouldn't be possible - layout shouldn't ever give us a row/colspanof 0 - but perhaps it's happening anyway.
If this happens, we'll now gracefully treat it as a span of 1.
I also added an assertion in case this helps us to track this down properly in future.

Rather than allocating elements one by one, bulk allocate as many as we know for sure we'll need.

First, move the cached table cleanup code out of LocalAccessible::UnbindFromParent and into Shutdown.
This makes no practical difference - it gets called either way when the document shuts down - but it's a bit clearer what's happening this way.

Second, add an assertion to CachedTableAccessible::GetFrom to ensure it is only ever given a table.
I don't think it would ever be given anything else, but this makes that clearer.

Pushed by jteh@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9ae193d1f988
part 1: Clean up CachedTableAccessibles when a DocAccessibleParent is destroyed. r=morgan
https://hg.mozilla.org/integration/autoland/rev/bbec368671f1
part 2: Gracefully handle row/colspan 0 for CachedTableAccessible. r=morgan
https://hg.mozilla.org/integration/autoland/rev/40fc5890be03
part 3: Pre-allocate row/col arrays in CachedTableAccessible. r=morgan
https://hg.mozilla.org/integration/autoland/rev/60b3d18b1d40
part 4: Clarify some code regarding CachedTableAccessible creation and cleanup. r=morgan
See Also: → 1772835

The patch landed in nightly and beta is affected.
:Jamie, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jteh)

CTW is now disabled on 102, plus this doesn't seem to quite fix the crashes anyway.

Flags: needinfo?(jteh)

Copying crash signatures from duplicate bugs.

Crash Signature: [@ OOM | large | NS_ABORT_OOM | nsTArray_base<T>::EnsureCapacity<T> | mozilla::a11y::CachedTableAccessible::EnsureRowCol] → [@ OOM | large | NS_ABORT_OOM | nsTArray_base<T>::EnsureCapacity<T> | mozilla::a11y::CachedTableAccessible::EnsureRowCol] [@ OOM | large | NS_ABORT_OOM | nsTArrayInfallibleAllocator::SizeTooBig | nsTArray_base<T>::EnsureCapacity<T> | mozilla::a11y::Cached…

The crashes do seem to have stopped with this patch; bug 1772835 turned out to be crashes from before the patch landed.

Crash Signature: [@ OOM | large | NS_ABORT_OOM | nsTArray_base<T>::EnsureCapacity<T> | mozilla::a11y::CachedTableAccessible::EnsureRowCol] [@ OOM | large | NS_ABORT_OOM | nsTArrayInfallibleAllocator::SizeTooBig | nsTArray_base<T>::EnsureCapacity<T> | → [@ OOM | large | NS_ABORT_OOM | nsTArray_base<T>::EnsureCapacity<T> | mozilla::a11y::CachedTableAccessible::EnsureRowCol] [@ OOM | large | NS_ABORT_OOM | nsTArrayInfallibleAllocator::SizeTooBig | nsTArray_base<T>::EnsureCapacity<T> |
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: