Closed Bug 1336971 Opened 8 years ago Closed 7 years ago

Crash in InvalidArrayIndex_CRASH | nsTArray_Impl<T>::operator[] | mozilla::a11y::AccessibleWrap::GetRemoteIAccessibleFor

Categories

(Core :: Disability Access APIs, defect, P2)

x86
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- disabled
firefox56 --- disabled
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: MarcoZ, Assigned: bugzilla)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [AV:Webroot SecureAnywhere][aes+])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-f319733b-ba0b-42c9-a426-191dd2170206.
=============================================================
I got this crash after Nightly froze again while I was working in a tab. This time, I was actually waiting for Jet Airways to process a booking update. Nightly froze, and I was forced to right-click its taskbar icon and select Close Window. Crash happened immediately afterwards. Leaving this here FYI, and perhaps you see something significant in the stack.

The site works fine with E10S off.
So, The only way I can see an out of bounds array access there is for the array to be mutated while we iterate over it, which really shouldn't be happening.  The only way I can see for it to happen is that while GetProxyAccessibleInSubtree() is makeing remote com calls to the content process the parent process handles IPC messages.  Aron does that seem like something that can be happening?
Flags: needinfo?(aklotz)
Flags: needinfo?(aklotz)
Whiteboard: aes+
(In reply to Trevor Saunders (:tbsaunde) from comment #1)
> So, The only way I can see an out of bounds array access there is for the
> array to be mutated while we iterate over it, which really shouldn't be
> happening.  The only way I can see for it to happen is that while
> GetProxyAccessibleInSubtree() is makeing remote com calls to the content
> process the parent process handles IPC messages.  Aron does that seem like
> something that can be happening?

I wouldn't expect that to happen given that the parent main thread is blocked on a remote COM call. ie, it shouldn't be able to dispatch any incoming IPDL while waiting for a reply from COM.

OTOH, since the main thread is in the single-threaded apartment, COM is continuously pumping messages while waiting for that reply. If we handle a message in such a way that we inadvertently trigger a nested event loop that ends up dispatching IPDL, then I guess it could happen.
really low volume crash, removing from our tracking list.
Whiteboard: aes+
Crash Signature: [@ InvalidArrayIndex_CRASH | nsTArray_Impl<T>::operator[] | mozilla::a11y::AccessibleWrap::GetRemoteIAccessibleFor] → [@ InvalidArrayIndex_CRASH | nsTArray_Impl<T>::operator[] | mozilla::a11y::AccessibleWrap::GetRemoteIAccessibleFor] [@ InvalidArrayIndex_CRASH | mozilla::a11y::AccessibleWrap::GetRemoteIAccessibleFor ]
(In reply to alexander :surkov from comment #4)
> Marco, do you have steps to reproduce?

Unfortunately not. I only got this once directly before I filed it. Haven't seen it happening since.
the signature is regressing in volume during the 57.0b cycle.

one striking correlation is (61.88% in signature vs 00.34% overall) Module "WRusr.dll" = true
Whiteboard: [AV:Webroot SecureAnywhere]
Currently #17 in the browser crash list. Wondering if there's something we can do here, this looks like some sort of race - 

https://crash-stats.mozilla.com/report/index/71c5d4c2-ea2a-4e73-aafa-1dc120171023
Whiteboard: [AV:Webroot SecureAnywhere] → [AV:Webroot SecureAnywhere][aes+]
Frame 36 is interesting, looks like COM reentry caused a mutation of the array in question.
(In reply to Aaron Klotz [:aklotz] (a11y work receiving priority right now, please send interceptor reviews to dmajor or handyman) from comment #2)
> (In reply to Trevor Saunders (:tbsaunde) from comment #1)
> > So, The only way I can see an out of bounds array access there is for the
> > array to be mutated while we iterate over it, which really shouldn't be
> > happening.  The only way I can see for it to happen is that while
> > GetProxyAccessibleInSubtree() is makeing remote com calls to the content
> > process the parent process handles IPC messages.  Aron does that seem like
> > something that can be happening?
> 
> I wouldn't expect that to happen given that the parent main thread is
> blocked on a remote COM call. ie, it shouldn't be able to dispatch any
> incoming IPDL while waiting for a reply from COM.
> 
> OTOH, since the main thread is in the single-threaded apartment, COM is
> continuously pumping messages while waiting for that reply. If we handle a
> message in such a way that we inadvertently trigger a nested event loop that
> ends up dispatching IPDL, then I guess it could happen.

Any way to guard this?
Priority: P3 → P2
Jamie said that he was going to take a look at this.
Assignee: nobody → jteh
Status: NEW → ASSIGNED
This crash has a high correlation with Webroot anti-virus so it may not be 100% our fault:

(60.87% in signature vs 00.39% overall) Module "WRusr.dll" = true [74.29% vs 02.36% if platform_version = 10.0.15063]
I'm going to take this since we need to land something ASAP for Beta. A band-aid fix is simple enough...
Assignee: jteh → aklotz
This patch isn't perfect, but it's a low-risk fix that is acceptable to beta. Since GetProxiedAccessibleInSubtree makes a remote COM call, it is possible for something to reenter and mutate the remoteDocs array while that call is pending.

By moving the length check into the loop's condition, we ensure that we always verify that the length is still valid.

I think a better long-term fix involves a bunch of stuff with iterators which I don't want to introduce for such a tight timeline.
Attachment #8923571 - Flags: review?(yzenevich)
Comment on attachment 8923571 [details] [diff] [review]
Always re-check the array length

Actually, Jamie, would you mind looking at this?
Attachment #8923571 - Flags: review?(yzenevich) → review?(jteh)
Comment on attachment 8923571 [details] [diff] [review]
Always re-check the array length

Nice! Please add a brief comment explaining that we're doing this to deal with the array length changing due to COM re-entry. I'm concerned that someone down the track might think this was a mistake and try to optimise it out. :) Also, don't forget to change the r= in the commit message.
Attachment #8923571 - Flags: review?(jteh) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b310709a1998f913245e2aafe17fe3c95b5906e
Bug 1336971: Ensure that we always re-examine the length of the top-level remote doc array to pick up any changes due to mutation; r=Jamie
https://hg.mozilla.org/mozilla-central/rev/1b310709a199
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Please request Beta approval on this when you get a chance.
Flags: needinfo?(aklotz)
Comment on attachment 8923571 [details] [diff] [review]
Always re-check the array length

Approval Request Comment
[Feature/Bug causing the regression]: Windows a11y on e10s
[User impact if declined]: crashes
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: Really simple fix
[String changes made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8923571 - Flags: approval-mozilla-beta?
Comment on attachment 8923571 [details] [diff] [review]
Always re-check the array length

Seems low risk, crash fix with decent volume on beta, beta57+
Attachment #8923571 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/05e050d6c868
(In reply to Aaron Klotz [:aklotz] (a11y work receiving priority right now, please send interceptor reviews to dmajor or handyman) from comment #19)
> [Is this code covered by automated tests?]: Yes
> [Has the fix been verified in Nightly?]: Yes
> [Needs manual test from QE? If yes, steps to reproduce]: No

Setting qe-verify- based on Aaron's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: