Open Bug 1419108 Opened 2 years ago Updated 5 days ago

Crash in shutdownhang | RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn

Categories

(Core :: Disability Access APIs, defect, P1, critical)

57 Branch
All
Windows
defect

Tracking

()

Tracking Status
firefox-esr52 --- wontfix
firefox-esr60 --- affected
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 - wontfix
firefox68 + wontfix
firefox69 --- fix-optional
firefox70 --- fix-optional

People

(Reporter: davidb, Assigned: Jamie, NeedInfo)

References

(Depends on 1 open bug)

Details

(4 keywords, Whiteboard: a11y:crash-win)

Crash Data

This is #13 in 57 release.

This bug was filed from the Socorro interface and is
report bp-25f8e0bb-b99a-4242-a3a2-0fe5a0171120.
=============================================================

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForMultipleObjects 
1 kernelbase.dll RtlAnsiStringToUnicodeString 
2 kernel32.dll WaitForMultipleObjectsExImplementation 
3 user32.dll RealMsgWaitForMultipleObjectsEx 
4 user32.dll MsgWaitForMultipleObjectsEx 
5 ole32.dll CCliModalLoop::BlockFn 
6 ole32.dll ModalLoop 
7 ole32.dll CChannelHandle::RestoreToken 
8 ole32.dll `AsyncStubInvoke'::`1'::filt$0 
9 ole32.dll ChannelProcessUninitialize 

=============================================================
NI some usual suspects for awareness.
Flags: needinfo?(jteh)
Flags: needinfo?(jmathies)
Flags: needinfo?(aklotz)
This could be related to a lot of the freezes users are reporting.

100.0% in signature vs 16.53% overall) moz_crash_reason = MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.)
Yeah, based on the samples that I've seen with this sig, it's an "unconventional" a11y client that is doing a bunch of stuff, which ends up being so slow that everything just grinds to a halt, hence the forced shutdown.
Flags: needinfo?(aklotz)
Aggregating that crash signature on "Accessibility client" shows some useful info.
There are some crashes in 58 as well but none yet in 59.
Crash Signature: [@ shutdownhang | RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] → [@ shutdownhang | RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | CCliModalLoop::BlockFn] [@ shutdownhang | BaseGetNamedObjec…
Depends on: 1405290
Depends on: 1437575
Whiteboard: a11y:crash-win
the [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn ] signature is jumping up on windows 10 since 2018-09-11 across release channels. many user comments indicate that users aren't perceiving this as a crash on shutdown but a hang/freeze during browsing.

the signature accounts for 2.5% of browser crashes on release in the past few days.
could someone look into that recent crash spike?
Flags: needinfo?(dbolter)
OS: Windows 7 → All
Hardware: Unspecified → All
OS: All → Windows
Taking NI from davidb, since I manage a11y now.

I actually spent a while looking into this last week, but didn't turn up anything useful. :(

It doesn't seem that this is related to a specific third party accessibility client.

One important point is that it seems many of these stacks have something in msctf.dll (which I think is associated with Microsoft Text Services Framework) calling IAccessible::accRole on an oleacc proxy for an accessible in a content process. The oleacc proxy then queries for some interface (maybe IAccIdentity? I need to check that), at which point things freeze. Note that this isn't true for all stacks; some have msctf calling AccessibleObjectFromEvent, which in turns calls accChild, which then freezes.
Regardless, these stacks suggest the content process stopped responding when (or before?) an a11y query was made, but what I don't understand is why. And without a stack from the content process (or a way to reproduce this), that's going to be nearly impossible to figure out. Theoretically, content might not even be freezing in the a11y call; it might have frozen before that point.

I also can't see anything that landed in the Gecko accessibility module around the time this spike started that seems like it might have any impact on this kind of thing.

I'll keep digging.
Flags: needinfo?(dbolter)
>I also can't see anything that landed in the Gecko accessibility module around the time this spike started that seems >like it might have any impact on this kind of thing.
the spike happened across different release channels at the same time, so it's unlikely that some code-change from our side caused it. timing wise the spike took place on the same day as the monthly windows patchday, so there might be some connection and a change in the platform triggered this.
asian locale builds (zh-cn, ja, ko) seem to be slightly overrepresented in these crash reports for what it's worth...
Marking this wontfix for 62/63, but we could still potentially take a fix for 64. 
The crash volume is very high on 62 on release, with over 5000 crashes in the last week.
Marking fix-optional for 64. We could still take a patch for 65, and if it's verified and doesn't seem risky, could still take fixes for 64 as well.
Duplicate of this bug: 1508251
Crash Signature: BaseGetNamedObjectDirectory | CCliModalLoop::BlockFn] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | NtUserCallNoParam | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] → BaseGetNamedObjectDirectory | CCliModalLoop::BlockFn] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | NtUserCallNoParam | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | NtUserMsg…
Adding 66 as affected. In 64 release shutdownhang | NtUserMsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn is #10 overall.
Looks like something in the 20181224215145 build may have decreased the crash volume in Nightly.

Not going to make 65 at this point, but leaving it set to fix-optional as a possible dot release ride-along should a low-risk patch be available.

Tracking for 66 to keep an eye on this in beta.

Changing the priority to p2 as the bug is tracked by a release manager for the current nightly.
See How Do You Triage for more information

Priority: P3 → P2

Marking 67 as affected. #6 overall browser crash in 65 release.

Adding another signature, Windows 10.

Crash Signature: [@ shutdownhang | RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | CCliModalLoop::BlockFn] [@ shutdownhang | BaseGetNamedObjec… → [@ shutdownhang | RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | NtUserMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn] [@ shutdownhang | MsgWaitForMultipleObject…

NI Aaron in case he has any quick hunches.

Flags: needinfo?(aklotz)

It would be really helpful if we also had dumps from the content process side of things.

By virtue of the fact that we already know which child process we're waiting on (since we're resolving a child id that has a content process id encoded into it), it would be nice if we had some kind of RAII mechanism to say, "Yo, hang reporter, if you get any hangs while I'm on the stack, take a paired minidump with this process, okay?"

Sorry I don't have a better answer atm, but we need more data :-)

Flags: needinfo?(aklotz)

Aaron, do you know who might work on getting us that sort of information? is that something for work on the crash reporter itself?

Flags: needinfo?(aklotz)

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #24)

Aaron, do you know who might work on getting us that sort of information? is that something for work on the crash reporter itself?

I think it probably involves the hang reporter more than anything. IIRC Gabriele has recently worked on the hang monitor.

Gabriele, is my proposal in Comment 23 reasonable? Do you or somebody you know have cycles to implement such an API?

Flags: needinfo?(aklotz) → needinfo?(gsvelto)

Yes, we could modify the shutdown terminator with something like that. The only issue that would need to be addressed is that we're currently just calling MOZ_CRASH() to generate the crash report; we would need to issue the action to take a paired minidump manually. It's not a huge deal but I'm swamped with urgent stuff and won't have free cycles before the end of next week.

Flags: needinfo?(gsvelto)
Depends on: 1527668

(In reply to Gabriele Svelto [:gsvelto] from comment #26)

Yes, we could modify the shutdown terminator with something like that. The only issue that would need to be addressed is that we're currently just calling MOZ_CRASH() to generate the crash report; we would need to issue the action to take a paired minidump manually. It's not a huge deal but I'm swamped with urgent stuff and won't have free cycles before the end of next week.

I filed bug 1527668 for this part.

Discussed with Gabriele in email; other projects have priority for this week.

Jamie we need an assignee on tracked bugs.

Since this needs an owner, I'll "own" this, but it's important to note that right now, it isn't actionable without more data/a reproduceable test case.

Assignee: nobody → jteh

We just had 50+ crashes over the week end on Nightly 67, Jamie is there something that landed recently that could explain this sudden spike on Nightly?

This is only a hunch, but next gen storage just got re-enabled on Nightly in bug 1517090. It was disabled originally due to a11y hangs; see bug 1516136. The hang was believed to be fixed (and I couldn't reproduce it any more with brief testing), but perhaps there's still an edge case?

Flags: needinfo?(jteh)

This is now the #2 browser crash on 67 nightly (shutdownhang | NtUserMsgWaitForMultipleObjectsEx | CCliModalLoop::BlockFn).

Keywords: topcrash

The deadlock described in comment 32 was fixed in bug 1534208. However, there seems to be another one: bug 1535221.

It'd be interesting to know whether the spike is caused by builds without the fix for bug 1534208 or whether the spike is still occurring even with that fix.

(In reply to James Teh [:Jamie] from comment #34)

It'd be interesting to know whether the spike is caused by builds without the fix for bug 1534208 or whether the spike is still occurring even with that fix.

the crash volume went down considerably in builds 20190311215435 & later that have the fix (from 20-50 crashes per build to 1-10 now)

(In reply to James Teh [:Jamie] from comment #32)

This is only a hunch, but next gen storage just got re-enabled on Nightly in bug 1517090. It was disabled originally due to a11y hangs; see bug 1516136. The hang was believed to be fixed (and I couldn't reproduce it any more with brief testing), but perhaps there's still an edge case?

The volume is very high on beta with about 100 crashes per day, so setting as P1. LSNG is delayed to 68 and will be deactivated in the next beta (67.0b9), which should tell you if the theory is correct.

Crashes went down by more than an order of magnitude on beta after we deactivated LSNG (310 crashes in beta 8 vs 22 in beta 10) so no need to track it anymore for 67 and marking as fix-optional for the release, but tracking for 68.

Hi Jamie, looks like this is related to LSNG (which is probably being enabled again for 68). Any updates? (tracking this in release triage for 68). Thanks!

Flags: needinfo?(jteh)

LSNG issue was fixed in bug 1535221.

I think we're now back to the original situation here; i.e. not LSNG, thus much lower volume, but without repro or data and thus little hope of a diagnosis or fix. :( See comment 10, comment 23, comment 26, comment 27.

Flags: needinfo?(jteh)

I might be able to reproduce.

Martijn: Were you able to reproduce this issue? Thanks.

Flags: needinfo?(martijn.martijn)

wontfix for 68 based on comment 40.

Also marking this stalled based on comment 40. Happy to take a patch though or otherwise help out.

You need to log in before you can comment on or make changes to this bug.