Closed Bug 1368270 Opened 7 years ago Closed 4 years ago

Crash in mozilla::a11y::IDSet::GetID (MSAA id exhaustion)

Categories

(Core :: Disability Access APIs, defect, P2)

52 Branch
All
Windows
defect

Tracking

()

VERIFIED FIXED
mozilla80
Tracking Status
firefox-esr52 --- wontfix
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox-esr78 81+ verified
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 + wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox78 --- wontfix
firefox79 --- wontfix
firefox80 --- verified

People

(Reporter: philipp, Assigned: Jamie)

References

(Regression)

Details

(Keywords: crash, regression, Whiteboard: a11y:crash-win )

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-61312828-4aef-4b83-8a9f-5a4fa0170526.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::a11y::IDSet::GetID() 	obj-firefox/dist/include/mozilla/a11y/IDSet.h:76
1 	xul.dll 	mozilla::a11y::MsaaIdGenerator::GetID() 	accessible/windows/msaa/MsaaIdGenerator.cpp:93
2 	xul.dll 	mozilla::a11y::AccessibleWrap::GetChildIDFor(mozilla::a11y::Accessible*) 	accessible/windows/msaa/AccessibleWrap.cpp:1331
3 	xul.dll 	mozilla::a11y::AccessibleWrap::FireWinEvent(mozilla::a11y::Accessible*, unsigned int) 	accessible/windows/msaa/AccessibleWrap.cpp:1230
4 	xul.dll 	mozilla::a11y::AccessibleWrap::HandleAccEvent(mozilla::a11y::AccEvent*) 	accessible/windows/msaa/AccessibleWrap.cpp:1280
5 	xul.dll 	nsEventShell::FireEvent(mozilla::a11y::AccEvent*) 	accessible/base/nsEventShell.cpp:45
6 	xul.dll 	mozilla::a11y::NotificationController::ProcessMutationEvents() 	accessible/base/NotificationController.cpp:552
7 	xul.dll 	mozilla::a11y::NotificationController::WillRefresh(mozilla::TimeStamp) 	accessible/base/NotificationController.cpp:813
8 	xul.dll 	nsRefreshDriver::Tick(__int64, mozilla::TimeStamp) 	layout/base/nsRefreshDriver.cpp:1798
9 	xul.dll 	mozilla::RefreshDriverTimer::TickDriver(nsRefreshDriver*, __int64, mozilla::TimeStamp) 	layout/base/nsRefreshDriver.cpp:326
10 	xul.dll 	mozilla::RefreshDriverTimer::TickRefreshDrivers(__int64, mozilla::TimeStamp, nsTArray<RefPtr<nsRefreshDriver> >&) 	layout/base/nsRefreshDriver.cpp:295
11 	xul.dll 	mozilla::RefreshDriverTimer::Tick(__int64, mozilla::TimeStamp) 	layout/base/nsRefreshDriver.cpp:316
12 	xul.dll 	mozilla::VsyncRefreshDriverTimer::RunRefreshDrivers(mozilla::TimeStamp) 	layout/base/nsRefreshDriver.cpp:663
13 	xul.dll 	mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(mozilla::TimeStamp) 	layout/base/nsRefreshDriver.cpp:583
14 	xul.dll 	mozilla::detail::RunnableMethodImpl<void ( mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::*)(mozilla::TimeStamp), 1, 0, mozilla::TimeStamp>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:810
15 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1216
16 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:361
17 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:124
18 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:225
19 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:205
20 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp:156
21 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp:262
22 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp:283
23 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp:4488
24 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp:4621
25 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp:4712
26 	firefox.exe 	do_main 	browser/app/nsBrowserApp.cpp:282
27 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:115
28 	firefox.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253
29 	kernel32.dll 	BaseThreadInitThunk 	
30 	ntdll.dll 	__RtlUserThreadStart 	
31 	ntdll.dll 	_RtlUserThreadStart

this crash signature on windows is showing up since firefox 52 and in subsequent versions. nearly all the reports were crashing with "MOZ_CRASH(used up all the available ids)" that got added in bug 606080.
Wow, that's ... remarkable... Sounds like a lot of tabs. Unfortunately there doesn't seem to be any of the known screen readers involved. At least not in the crash from comment #0.

Aaron, ever seen this?
Flags: needinfo?(aklotz)
Wow... now that was something I was not expecting to see.

There is a different assertion that I was expecting to see but that one only shows up when dom.ipc.processCount >= 128. This one I was definitely not expecting.

It's implying one of two things. Either:

1) There are so many accessibles that they have exhausted all 2^24 unique ids; or
2) Accessibles are not always releasing their ids when they are destroyed.
Flags: needinfo?(aklotz)
Given that I've seen this crash linked to MemShrink bugs, I am convinced that this is just a symptom of a bigger problem: there is a leak elsewhere such that nodes (and their associated accessibles) are not being cleaned up and we're exhausting our unique ID space.

I don't think there is much we can do here other than fix the bug(s) that are causing the node leakage.
(In reply to Aaron Klotz [:aklotz] (a11y work receiving priority right now, please send interceptor reviews to dmajor or handyman) from comment #3)
> Given that I've seen this crash linked to MemShrink bugs, I am convinced
> that this is just a symptom of a bigger problem: there is a leak elsewhere
> such that nodes (and their associated accessibles) are not being cleaned up
> and we're exhausting our unique ID space.
> 
> I don't think there is much we can do here other than fix the bug(s) that
> are causing the node leakage.

I bet if we had more data, then we might had more ideas. For example, do we know whether this happens both in e10s and non e10s builds? Whether there are correlations for user's surfing habbits, e.g. number of open tabs, life time of the tabs, whether the problem is visible on certain websites only.

Also it might be not Gecko's problem if I understand it right. If AT fails to release an accessible object, then we should face this issue sooner or later. If the latter issue is valid, then what steps can taken on this way? Can we force the IDs pool clearance at some point?

So, how can we approach to this?
Setting it to P1.

Aaron, do you have actionable ideas on this bug?
Flags: needinfo?(aklotz)
Priority: -- → P1
Not at the moment. I will mark it for triage and we'll discuss it tomorrow.
Flags: needinfo?(aklotz)
Whiteboard: aes+
Summary: Crash in mozilla::a11y::IDSet::GetID → Crash in mozilla::a11y::IDSet::GetID (MSAA id exhaustion)
Whiteboard: aes+
This is a P1 bug without an assignee. 

P1 are bugs which are being worked on for the current release cycle/iteration/sprint. 

If the bug is not assigned by Monday, 28 August, the bug's priority will be reset to '--'.
Keywords: stale-bug
Priority: P1 → P2
Whiteboard: a11y:crash-win
This crashed a tab for one of my web app's testers. My web app is a continuous stream of live data that is presented by adding server formatted html to the page using element.innerHTML = new server data. The server data has around 20 html elements, and it is updated 5-10 times a second.

The user had only a couple of other non-demanding tabs open. He wasn't using FF to surf other than for testing the web app. He had FF open for awhile, maybe a few days. I will try to make a test case that reliably triggers this crash.

https://crash-stats.mozilla.com/report/index/f0cbea81-9665-4757-82da-144271180615#tab-details
(In reply to justinpulliam from comment #9)
> This crashed a tab for one of my web app's testers. My web app is a
> continuous stream of live data that is presented by adding server formatted
> html to the page using element.innerHTML = new server data. The server data
> has around 20 html elements, and it is updated 5-10 times a second.

Thanks for the details. Just to double check, do you mean the data is completely *replaced* every update with 20 elements or that 20 elements are *added* every update (so 20, then 40, then 60, etc. total)? I'm guessing the former, but just wanted to clarify given your use of the term "stream of data".

After bug 1434822, we now force disconnect remote accessibility clients when a content accessible is shut down. That should cause the id to be released. So, if the data is being replaced (not added to), we must be leaking an accessible somehow and not shutting it down.
Closing because no crashes reported for 12 weeks.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Closing because no crashes reported for 12 weeks.
There are 1000+ crashes for this signature in the past week.
Someone should investigate why that bot couldn't find them...
How many other bugs like this were erroneously closed?

bp-dde060dd-cc7b-4cc8-8fa2-5d8780190106
MOZ_CRASH Reason: MOZ_CRASH(used up all the available ids)
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---

Jamie, is this on the radar for getting fixed soon?

Flags: needinfo?(jteh)

No useful comments, will check URLs for some clues.

I don't think we have enough data to take any action here. As per earlier comments, this would happen if there are more than 2 ^ 24 accessible objects in a single process. Either:

  1. Some users really do have this many accessibles. This would mean that there are at least this many visible DOM nodes (attached to document and not display: none). Do we have any data on this?
  2. There is a leak somewhere preventing DOM nodes from being cleaned up and thus their associated accessibles.
  3. There is a leak somewhere preventing dead accessibles from being completely destroyed.
    Without any clue as to how to reproduce this, I don't know how we can figure out which of these it is.
Flags: needinfo?(jteh)

(In reply to James Teh [:Jamie] from comment #17)

I don't think we have enough data to take any action here. As per earlier comments, this would happen if there are more than 2 ^ 24 accessible objects in a single process. Either:

  1. Some users really do have this many accessibles. This would mean that there are at least this many visible DOM nodes (attached to document and not display: none). Do we have any data on this?
  2. There is a leak somewhere preventing DOM nodes from being cleaned up and thus their associated accessibles.
  3. There is a leak somewhere preventing dead accessibles from being completely destroyed.
    Without any clue as to how to reproduce this, I don't know how we can figure out which of these it is.

I wonder if mccr8 or smaug or hsivonen have numbers/thoughts on the above?

Flags: needinfo?(hsivonen)
Flags: needinfo?(continuation)
Flags: needinfo?(bugs)

It is certainly plausible that a page could leak a few million nodes. Comment 9 does sound like something that could do that. We've seen infinite scrolling news feed kind of things do that in the past.

Flags: needinfo?(continuation)

Maybe this is a WONTFIX?

(In reply to Andrew McCreight [:mccr8] from comment #19)

It is certainly plausible that a page could leak a few million nodes. Comment 9 does sound like something that could do that. We've seen infinite scrolling news feed kind of things do that in the past.

When you refer to leaked nodes, are these attached to the document and visible (not display: none)? Accessibles should get destroyed if the node is not attached and/or display: none.

Flags: needinfo?(continuation)

Well, mostly when I've seen it they aren't in the document, but I could imagine a page that just gets longer and longer as more things are loaded in.

Flags: needinfo?(continuation)

(In reply to justinpulliam from comment #9)

This crashed a tab for one of my web app's testers. My web app is a
continuous stream of live data that is presented by adding server formatted
html to the page using element.innerHTML = new server data. The server data
has around 20 html elements, and it is updated 5-10 times a second.

The user had only a couple of other non-demanding tabs open. He wasn't using
FF to surf other than for testing the web app. He had FF open for awhile,
maybe a few days. I will try to make a test case that reliably triggers this
crash.

Any update on a test case or whether your app can be accessed via so public URL? Can you clarify if you used innerHTML to add nodes or to replace nodes?

Flags: needinfo?(hsivonen) → needinfo?(justinpulliam)

2 ^ 24 nodes takes over 2GB on 64bit system + then all the other memory usage. But sounds still possible.

Does a11y keep Accessible objects alive also for the DOM trees in background tabs?

Flags: needinfo?(bugs)

(In reply to Olli Pettay [:smaug] from comment #24)

Does a11y keep Accessible objects alive also for the DOM trees in background tabs?

Yes, as some clients (notably screen readers) need them in order to keep their own models up to date.

This is a pretty high crash volume in beta 66. Tracking for 66 but from jimm's comments in triage it may be a difficult problem relating to e10s.

the user at https://support.mozilla.org/en-US/questions/1247107 had a reproducible case (unfortunately it looked like an internal app).

It'd be good to know what that app does when it adds new data. Does it just keep adding more elements to the document or does it remove old elements? If a document contains enough nodes to cause this crash, I would think it would consume a huge amount of memory, even with accessibility disabled.

Hi everyone,

I received an email from philipp that you had interest in what caused the daily crash in my Firefox.

I really do not know which "browser window/tab" caused the crashes, if any. We have as many as 10 tabs/windows running 24/7. I had given a photo of what they look like here: https://support.mozilla.org/en-US/questions/1247107

All I know is that the solution was to activate the "prevent accessibility services from accessing your browser" setting. I do not really understand what this setting does. There have been no more crashes after activating this setting. So, it's kind of like a "magic button" to me (press the magic button and the problem is solved). But all my tabs continue to function as usual and are not affected by this setting.

I would be happy to provide more information, but I think I may need some guidance on what information to provide.

Thanks for helping to solve my daily crash.

Thanks,
Jonah.
+65-98350243

James, any other ideas on this one for 66?

Flags: needinfo?(jteh)
Whiteboard: a11y:crash-win → a11y:crash-win [MemShrink]
Whiteboard: a11y:crash-win [MemShrink] → a11y:crash-win

(In reply to jonahyongntu from comment #29)

I received an email from philipp that you had interest in what caused the daily crash in my Firefox.

Thanks so much for following up and for your willingness to help.

All I know is that the solution was to activate the "prevent accessibility services from accessing your browser" setting. I do not really understand what this setting does.

Normally, accessibility services are used by people with disabilities; e.g. a screen reader used by blind users. I assume you don't have any users with disabilities requiring such tools, as preventing accessibility services would cause Firefox to be inaccessible to them. Unfortunately, some other apps (security software, anti-virus software, malware, etc.) use Firefox's accessibility framework as well. My guess is that some app on your system was doing this. While not entirely related, I'd be curious to know what app is triggering this on your system. You can determine this as follows:

  1. Uncheck "Prevent accessibility services" in Firefox Options and restart Firefox.
  2. Go to this address using the address bar: about:support
  3. Look for "Accessibility Instantiator" and include the information specified there in your reply.

I would be happy to provide more information, but I think I may need some guidance on what information to provide.

With regard to the tabs you have open, do any of them continually add new information to the document without clearing previous information? That is, the amount of information in the document continues to increase the longer it is open (and the further you can scroll back the longer it is open)? Or does the information on the page simply update (rather than adding more content to the bottom)?

Thanks again for your help.

Flags: needinfo?(jteh) → needinfo?(jonahyong)

Jamie we need an assignee on tracked bugs.

Flags: needinfo?(jteh)

Since this needs an owner, I'll "own" this, but it's important to note that right now, it isn't actionable without more data/a reproduceable test case.

Assignee: nobody → jteh
Flags: needinfo?(jteh)

(In reply to James Teh [:Jamie] from comment #33)

Since this needs an owner, I'll "own" this, but it's important to note that right now, it isn't actionable without more data/a reproduceable test case.

should it be a tracked bug then?

It's being tracked because of the high crash volume as per comment 26. I guess the hope is that we'll somehow acquire data that will allow us to implement a fix. I very much doubt this is going to happen at this point, though.

Hi James,

Sorry. The system needs to be running 24/7, so I do not want to "undo the solution" by unchecking "prevent accessibility services".

But I can answer the questions regarding how the tabs work.

None of the tabs get new information added to them continually:
https://prod-cdn.sumo.mozilla.net/uploads/images/2019-01-20-18-07-51-3e20d4.png

The "charts using divs and pngs" receive data for about 5 seconds, and then just stay there. There is a javascript to "reload the page" once every 5 minutes.

The webmap, gif animations and remaining tabs are pretty static too. They just load once and then sit there forever.

Flags: needinfo?(jonahyong)
No longer blocks: 606080
Has Regression Range: --- → yes
Regressed by: 606080

Wontfix for the releases in flight as this is currently stalled.

Attached file crash.html

This is a test case that crashes my browser (both 70.0.1 and compiled from source) at that assertion in less than an hour.

The relevant section of my about:support is

Accessibility
Activated 	true
Prevent Accessibility 	0
Accessible Handler Used 	false
Accessibility Instantiator 	UNKNOWN|
Library Versions

When an Accessible is shut down, there might still be external references to it, so we keep the instance alive.
We don't want to reuse the id until all references are released, so we only release the id in the destructor (which runs when all references are released).
Previously, we were clearing the id variable in Shutdown, which meant we couldn't release the id in the destructor!
This meant we always leaked ids, never reusing them.
Now, we don't clear the id in Shutdown.

Keywords: stale-bug
Pushed by mzehe@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/38eb7e998859
When shutting down Windows AccessibleWraps, don't clear the id. r=MarcoZ
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80

Too late to get it into 79.0 at this point, but I think we may want to consider getting it onto ESR78 eventually given the high crash volume and maybe even a 79.0.1 dot release.

I'd really like this to bake on nightly and perhaps even beta for a few weeks at least. While I believe the patch itself to be safe, it causes ids to be reused. That's precisely the original intention... but the previous (non-reuse) behaviour existed for years and there's a slight chance we might see some fallout. For example, if an object dies, it's slightly possible its id might get reused before a client has even had a chance to process the fact that the old object died, particularly in an asynchronous, coalescing world.

Flags: qe-verify+

I've reproduced the crash signatures with Fx 72.0a1 on Windows 10, helped by the information from comment 38 (after running the browser for about 45m).
I did not longer reproduced this particular crash with the latest builds RC 80.0 and 81.0a1 (Windows 10), but after around 1h20m the browser crashed with the OOM signature, which due to the nature of the test case I think should be expected.
James, is this expected indeed, should I consider this issue verified ( note that I've checked with several profiles without encountering the mozilla::a11y::IDSet::GetI crash at all) or should it be something else to look for?

Flags: needinfo?(jteh)

(In reply to Anca Soncutean [:Anca], Desktop Release QA from comment #45)

I've reproduced the crash signatures with Fx 72.0a1 on Windows 10, helped by the information from comment 38 (after running the browser for about 45m).

Were you running an accessibility client? What did the Accessibility section of about:support show for you?

I did not longer reproduced this particular crash with the latest builds RC 80.0 and 81.0a1 (Windows 10), but after around 1h20m the browser crashed with the OOM signature, which due to the nature of the test case I think should be expected.

The test case gets rid of old nodes when it adds new ones, so seeing OOM here concerns me. It'd be interesting to know whether you see OOM (though it might take longer) with the accessibility.force_disabled pref set to 1.

should I consider this issue verified ( note that I've checked with several profiles without encountering the mozilla::a11y::IDSet::GetI crash at all)

I think we can consider this verified.

or should it be something else to look for?

If you can't repro the OOM with accessibility.force_disabled set to 1, even with a longer time period, we should file anew bug in disability access APIs. Thanks!

Flags: needinfo?(jteh)

Hi Jamie, did you want to nominate this for ESR78 approval now or wait for another cycle? Note that we'll be migrating ESR68 users to ESR78 with the upcoming 78.3esr release.

Flags: needinfo?(jteh)

Comment on attachment 9165607 [details]
Bug 1368270: When shutting down Windows AccessibleWraps, don't clear the id.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Crashes.
  • User impact if declined: Crashes with accessibility enabled on pages which create a lot of nodes.
  • Fix Landed on Version: 80
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Straightforward change which results in the behaviour that was originally intended. Baked on beta, nightly and release with no reported issues.
  • String or UUID changes made by this patch: None
Flags: needinfo?(jteh)
Attachment #9165607 - Flags: approval-mozilla-esr78?
QA Whiteboard: [qa-triaged]

Comment on attachment 9165607 [details]
Bug 1368270: When shutting down Windows AccessibleWraps, don't clear the id.

crash fix, approved for 78.3

Attachment #9165607 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+

As already mentioned in comment 45, I’ve reproduced the crash signatures with Fx 72.0a1 on Windows 10. I could no longer reproduced this crash (or the OOM signature I’ve ran into when attempting to verified this issue the first time). Issue verified fixed with Nightly 82.0a1 (20200909213959), RC 80.0.1 (20200831163820) and ESR 78.3.0 (treeherder build - 20200909191438) on Windows 10.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: