Closed Bug 1786676 Opened 2 years ago Closed 1 year ago

[CTW] CTW causes occasional infinite loop in NVDA vbuf code

Categories

(Core :: Disability Access APIs, defect)

Desktop
Windows
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: Jamie, Unassigned)

References

Details

(Whiteboard: [ctw-m5])

There's no reliable way to reproduce this that I've found. It happens rarely, but when it does, you have to kill the browser. While I'm not ruling out a bug in NVDA, I haven't heard any reports of this without CTW or in Chromium browsers (which use the same NVDA vbuf code), so it seems to be related to CTW somehow.

Sometimes, Firefox will freeze without warning and report "not responding". If I attach with a debugger, I see a stack like this:

00 000000fa`927fe350 00007ffc`c3d59c52     nvdaHelperRemote!VBufStorage_buffer_t::insertNode(class VBufStorage_controlFieldNode_t * parent = 0x000002ac`9853a160, class VBufStorage_fieldNode_t * previous = 0x00000000`00000000, class VBufStorage_fieldNode_t * node = 0x000002ac`98a88210)+0xc53 [C:\projects\nvda\build\x86_64\vbufBase\storage.cpp @ 494] 
01 000000fa`927fe3f0 00007ffc`c3d69afe     nvdaHelperRemote!VBufStorage_buffer_t::replaceSubtrees(class std::map<VBufStorage_fieldNode_t *,VBufStorage_buffer_t *,std::less<VBufStorage_fieldNode_t *>,std::allocator<std::pair<VBufStorage_fieldNode_t * const,VBufStorage_buffer_t *> > > * m = 0x000000fa`927fe570 { size=0x2 })+0xa82 [C:\projects\nvda\build\x86_64\vbufBase\storage.cpp @ 690] 
02 000000fa`927fe540 00007ffc`c3d6928c     nvdaHelperRemote!VBufBackend_t::update(void)+0x24e [C:\projects\nvda\build\x86_64\vbufBase\backend.cpp @ 214] 
03 000000fa`927fe600 00007ffd`1489121c     nvdaHelperRemote!VBufBackend_t::renderThread_timerProc(struct HWND__ * hwnd = 0x00000000`00000000, unsigned int msg = 0, unsigned int64 timerID = 0x3f4b, unsigned long time = 0x80000022)+0xbc [C:\projects\nvda\build\x86_64\vbufBase\backend.cpp @ 120] 
04 000000fa`927fe630 00007ffd`14890cf8     user32!UserCallWinProc+0x2cc
05 000000fa`927fe790 00007ffc`75309b7d     user32!DispatchMessageWorker+0xf8
06 000000fa`927fe810 00007ffc`7519f2ed     xul!nsAppShell::ProcessNextNativeEvent(bool mayWait = <Value unavailable error>)+0x1ed [/builds/worker/checkouts/gecko/widget/windows/nsAppShell.cpp @ 738] 
...

Here is the code at the top of the stack:

		for(VBufStorage_fieldNode_t* ancestor=node->parent;ancestor!=NULL;ancestor=ancestor->parent) {
			ancestor->length+=node->length;
...

This suggests that somehow, NVDA's vbuf got into a state where there was a loop in the tree such that walking the ancestors is an infinite loop. This happens during an update to the buffer, not during the initial build. This could mean that the Gecko a11y tre contained a loop at some point. Alternatively, it could mean that Gecko a11y isn't firing some event that it should, so NVDA isn't getting the correct info it needs to update the buffer correctly.

I had hoped that bug 1782146 would fix this, but I saw it happen after that fix. :(

FWIW, I haven't seen such browser freezes with JAWS and CTW enabled yet.

Thanks; that's good to know. Even if this is caused by a Firefox bug, I think the NVDA code could probably be more resilient; it shouldn't ever get into a state where the tree has loops, even if the browser events were wrong or something. Unfortunately, I haven't managed to pin down where things could be tightened up yet.

Depends on: 1663796

I haven't been able to reproduce this infinite loop since these two fixes landed. Closing.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

Drat. I just saw this again! This time, I saw nvdaHelperRemote!VBufBackend_t::forceUpdate in the stack, which means a buffer update was being forced in response to a focus or alert event. That means it's possibly a slightly different issue.

I'm moving this out of m3, since this is clearly going to take more time to investigate and it does seem to be happening more rarely now.

Whiteboard: [ctw-m3] → [ctw-m4]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

I've seen this a few times since. The case I saw just now didn't involve forceUpdate, so we still have a problem that isn't related to focus or alert events here. Ug.

I've been mulling over this for a while now and still haven't gotten anywhere solid. Some thoughts before I head out for a couple of weeks.

Assuming there isn't a loop in the Gecko a11y tree (and I imagine Gecko would crash or freeze elsewhere if there were), I've been trying to work out how a loop could be created in NVDA's buffer. The only scenario I can come up with is that while updating a subtree (n1), NVDA encountered a node it thought it could reuse (n2), but that reused node actually had an ancestor of the subtree root as one of its descendants (e.g. n2 had n1 as a child). In turn, I think this could only happen if an id was reused for a different node and somehow NVDA didn't know about or hadn't processed the removal of the old node. Two possibilities there:

  1. Gecko didn't fire some event that it should.
  2. Gecko did fire the right event for the parent and NVDA did invalidate the subtree. However, before NVDA could update the subtree, the id got reused.
    Gecko has a 1 second timer before ids are released. That should be sufficient, since NVDA buffer updates happen every 100 ms. However, maybe the buffer update gets delayed for some reason - e.g. heavy processing in Firefox's main thread, perhaps even from other buffers - and the Gecko id release timer runs before the buffer update.

We could try increasing the id release timer, but I worry that's just kicking the can down the road.

Another thought is that maybe NVDA could just remove subtrees from its buffer when it gets a hide event. Hide events are likely more reliable than text removed/reorder events. I'm not sure if that's feasible, though, and it'd be hacking around a (mysterious) Gecko bug.

I've also been wondering how I can get more info about this, given how hard it is to reproduce. Is there some ancestor/descendant check + log I can add to the reuse code? Walking the entire subtree for every reuse looking for a node rendered into a temp buffer seems pretty intense. Or maybe I can detect the infinite loop, log the parent chain and break out in such a way that the code doesn't freeze elsewhere? I've tried breaking out of the insertNode loop before and that wasn't enough; it just froze later. So, I somehow need to log a bunch of stuff and then clear the buffer, null out a parent or something.

Finally, it might be worth looking into how document destruction impacts things. I don't think NVDA uses an event for document destruction; it relies on checking the state of the document when something happens (focus?) to make it check that. If the focus (or whatever) event got delayed, I guess that could mean a stale document is hanging around with a bunch of ids that might get reused in different documents and subsequently fire events.

This is still very rare and there still isn't a clear path forward.

Whiteboard: [ctw-m4] → [ctw-m5]

(In reply to James Teh [:Jamie] from comment #6)

Finally, it might be worth looking into how document destruction impacts things. I don't think NVDA uses an event for document destruction; it relies on checking the state of the document when something happens (focus?) to make it check that. If the focus (or whatever) event got delayed, I guess that could mean a stale document is hanging around with a bunch of ids that might get reused in different documents and subsequently fire events.

Since this is a simple place to start, I modified NVDA locally to log an error and skip the event if it gets an event for a node in a buffer whose document has died:

diff --git a/nvdaHelper/vbufBackends/gecko_ia2/gecko_ia2.cpp b/nvdaHelper/vbufBackends/gecko_ia2/gecko_ia2.cpp
index 7f7cdf8fa..07062ff69 100755
--- a/nvdaHelper/vbufBackends/gecko_ia2/gecko_ia2.cpp
+++ b/nvdaHelper/vbufBackends/gecko_ia2/gecko_ia2.cpp
@@ -1361,6 +1361,13 @@ void CALLBACK GeckoVBufBackend_t::renderThread_winEventProcHook(HWINEVENTHOOK ho
 		VBufStorage_controlFieldNode_t* node=backend->getControlFieldNodeWithIdentifier(docHandle,ID);
 		if(!node)
 			continue;
+		{
+			CComPtr<IAccessible2> doc = IAccessible2FromIdentifier(docHandle, backend->rootID);
+			if (!doc) {
+				LOG_CRITICAL("jtd Got a node from a dead doc!");
+				continue;
+			}
+		}
 		if (eventID == EVENT_OBJECT_HIDE) {
 			// When an accessible is moved, events are fired as if the accessible were
 			// removed and then inserted. The insertion events are fired as if it were

I saw this log message earlier today after closing a tab, but another document already had focus when the message was logged; there was no freeze or delay. However, this made me realise that the old document might still be alive for a short while after the new one gets focus. In that case, NVDA wouldn't kill the buffer until the next focus event, which could be much later, long after the user started interacting with the new document. This means increasing the id release timer isn't going to fix this.

See Also: → 1817349

I submitted a pull request to NVDA 2 weeks ago to ensure its vbuf doesn't try to render anything if a document is dead. I've been running with that since then and haven't seen this problem. That pull request has now been merged into NVDA alpha builds and should be in the next release.

Status: REOPENED → RESOLVED
Closed: 2 years ago1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.