Closed Bug 1516136 Opened 2 years ago Closed 2 years ago

New localStorage causes hangs when run with accessibility enabled

Categories

(Core :: Storage: localStorage & sessionStorage, defect, P1)

66 Branch
defect

Tracking

()

VERIFIED FIXED
Tracking Status
firefox67 --- verified

People

(Reporter: astrojny, Assigned: janv)

References

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0

Steps to reproduce:

Downloaded latest version of Nightly on 12/22/2018


Actual results:

Nightly freezes.  Sometimes after visiting one website, sometimes after visitng several. 


Expected results:

Nithly freezes.  Have to go to Task Master (contol alt delete) to close FIrefox Nightly
I'm getting this too. At first I thought it was just on Google but it froze just clicking next on an AnandTech comment pager. There is non-trivial CPU activity on many processes, but windows don't refresh, painting is frozen.

Lots of activity on uctrbase.dll!o_exp+0x20 and in xul.dll. Trying to debug didn't work as Firefox killed itself (VS 2015).
 
I noticed one of the recent updates was ~6MB larger than usual (11MB vs 6MB). Maybe related to this new version/component?

AMD Bobcat-based x120e, Windows 10 AMD64 latest edition with all patches, Firefox Nightly, 6GB RAM, 64-bit. Doesn't take long after launch for me.
I lied, apparentl debugging did work, just hella slow on this.

The main thread appears to be stuck in an NtUserMsgWaitForMultipleObjectsEx in AccessibleHandler.dll, called from xul.dll.

Possibly related to bug 1428759 or bug 1508251 / bug 1419108?
Not sure AccessibleHandler is the issue. Attached is a listing of all the threads inside the named (which of course might not be the actual problem...).

When a page is loading the loading animation in the tab will continue (but the rest is frozen, usually with the tooltip indicating a resource loading at the bottom left open as well). 

Running in safe mode (disabling e.g. NoScript) does not appear to stop this from happening. Happens almost immediately on the BBC News website. I have lots of tabs and windows open from restore which might make it hit something more frequently.
The most reliable way to freeze it seems to be to do a Google search and then try to go somewhere else with the tab while it's loading. Happens maybe half the time. Possibly because Google has so much JS running, or maybe it's a race condition.
I got a crash on this system related to "PLDHashTable::Search | nsBaseHashtable<T>::Remove | mozilla::dom::LSDatabase::AllowToClose":
https://crash-stats.mozilla.org/report/index/61d711ad-8912-4707-9449-225e00181223#tab-details

Related crashes started 12/22/2018 @ 12:48am (UTC). So, possibly a related issue? See Bug 1516104 and Bug 1516111.
Next-gen storage was toggled to enabled two days ago:
https://bugzilla.mozilla.org/show_bug.cgi?id=1516023#c1

I changed dom.storage.next_gen to false in about:config and the freezes appear to have stopped.
Laurence, thanks for all the information.

Could you test one more thing ? Do you still experience the hang/crash if you:
change dom.storage.next_gen back to true
change dom.storage.snapshot_reusing to false

Thanks.
The browser no longer seems to freeze for more than a half-second or so occasionally, on a workload including numerous Google searches while other things are going on, so I'm guessing you fingered the issue.

In case it matters, this is a very old continuous session. Places is 82MB, faicons is 70MB, webappsstore is 61MB, session store is 800KB over four windows and hundreds of tabs, etc. Yeah, I know I should cut down. :-)
OK, thanks.

I need to investigate more, but the information you collected plus the crash reports help a lot.
Unfortunately while the setting in question potentially decreases the incidence of such freezes, I *seem* to have run into one again just now, while trying to open a bunch of results in tabs while Google was finishing processing a search result page. As before, the tab animation was sweeping left and right, and CPU processing was occurring, but everything appeared frozen.

Sorry if that increases the search area again, but at least you know.
I'm being bitten by this. I believe this is due to an interaction between the next gen storage stuff and a11y. I know nothing about the next gen storage code, but according to bug 1513859 and my own debugging, next gen storage sometimes needs to make sync IPC calls to the parent. The problem is that the parent sometimes needs to make sync COM calls to the content process for a11y. It goes something like this:

1. A new content document loads.
2. An a11y event gets fired for the document.
3. An a11y client asks for an accessible in the document.
4. The parent process makes a sync COM call to get the accessible.
5. A COM MTA thread in the content process receives the call and queues a sync query to the content main thread.
6. Meanwhile, before the content main thread can handle 5), next gen storage makes a sync IPC call to the parent in the content main thread.
7. Deadlock: steps 4, 5 and 6.

Debugging:

I didn't save a stack for the parent process, but I saw it was stuck in a sync COM call to IAccessible::get_accChild on a content document accessible.

Content COM MTA thread:
0:040> ~38 kp
 # Child-SP          RetAddr           Call Site
# Waiting on the main thread
00 000000a5`d0a7d4e8 00007fff`ed48c01e ntdll!NtWaitForMultipleObjects+0x14
01 000000a5`d0a7d4f0 00007fff`ed48bf0e KERNELBASE!WaitForMultipleObjectsEx+0xfe
02 000000a5`d0a7d7f0 00007fff`99eeeee6 KERNELBASE!WaitForMultipleObjects+0xe
03 000000a5`d0a7d830 00007fff`99ef03e6 xul!mozilla::mscom::SpinEvent::Wait(void * aTargetThread = <Value unavailable error>)+0x76 [z:\build\build\src\ipc\mscom\SpinEvent.cpp @ 67] 
04 000000a5`d0a7d8e0 00007fff`99ef233f xul!mozilla::mscom::MainThreadInvoker::Invoke(struct already_AddRefed<nsIRunnable> * aRunnable = <Value unavailable error>)+0xe6 [z:\build\build\src\ipc\mscom\MainThreadInvoker.cpp @ 161] 
05 000000a5`d0a7d940 00007fff`ef4a7234 xul!mozilla::mscom::MainThreadHandoff::OnCall(struct ICallFrame * aFrame = 0x000001c4`87b66580)+0x12f [z:\build\build\src\ipc\mscom\MainThreadHandoff.cpp @ 350] 
06 000000a5`d0a7da50 00007fff`ef4d3296 ole32!Interceptor::CallIndirect(HRESULT * phReturnValue = 0x000000a5`d0a7dc30, unsigned long iMethod = <Value unavailable error>, void * pvArgs = 0x000000a5`d0a7dc40, unsigned long * pcbArgs = 0x000000a5`d0a7dc28)+0x104 [com\ole32\com\txf\callframe\interceptor.cpp @ 1370] 
07 000000a5`d0a7dab0 00007fff`ef4d3296 ole32!Interceptor::CallIndirect+0x2c166 [com\ole32\com\txf\callframe\interceptor.cpp @ 1334] 
08 000000a5`d0a7db10 00007fff`ef4d3296 ole32!Interceptor::CallIndirect+0x2c166 [com\ole32\com\txf\callframe\interceptor.cpp @ 1334] 
09 000000a5`d0a7db70 00007fff`ef4d042a ole32!Interceptor::CallIndirect+0x2c166 [com\ole32\com\txf\callframe\interceptor.cpp @ 1334] 
0a 000000a5`d0a7dbd0 00007fff`ef6777e3 ole32!COMInterceptor(void)+0x6a [com\ole32\com\txf\callframe\amd64\stubless.asm @ 198] 
0b 000000a5`d0a7dc40 00007fff`ef6db4a6 rpcrt4!Invoke+0x73
# More COM stuff...

Content main thread:
0:040> ~0 kp
 # Child-SP          RetAddr           Call Site
# Waiting on parent process
00 000000a5`cd7eda58 00007fff`f12afa89 ntdll!NtWaitForAlertByThreadId+0x14
01 000000a5`cd7eda60 00007fff`ed4603fd ntdll!RtlSleepConditionVariableSRW+0x109
02 000000a5`cd7edad0 00007fff`d910f0a3 KERNELBASE!SleepConditionVariableSRW+0x2d
03 000000a5`cd7edb10 00007fff`9982be16 mozglue!mozilla::detail::ConditionVariableImpl::wait(class mozilla::detail::MutexImpl * lock = <Value unavailable error>)+0x13 [z:\build\build\src\mozglue\misc\ConditionVariable_windows.cpp @ 50] 
04 000000a5`cd7edb40 00007fff`97fba4db xul!mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::LabeledEventQueue> >::GetEvent(bool aMayWait = <Value unavailable error>, mozilla::EventPriority * aPriority = 0x000000a5`cd7edc4c)+0x196 [z:\build\build\src\xpcom\threads\ThreadEventQueue.cpp @ 148] 
05 000000a5`cd7edbd0 00007fff`97fba1b9 xul!nsThread::ProcessNextEvent(bool aMayWait = true, bool * aResult = 0x000000a5`cd7ee17f)+0x20b [z:\build\build\src\xpcom\threads\nsThread.cpp @ 1081] 
06 000000a5`cd7ee150 00007fff`9b7167a6 xul!NS_ProcessNextEvent(class nsIThread * aThread = <Value unavailable error>, bool aMayWait = true)+0x29 [z:\build\build\src\xpcom\threads\nsThreadUtils.cpp @ 468] 
07 000000a5`cd7ee1a0 00007fff`9b7056ff xul!mozilla::SpinEventLoopUntil<mozilla::ProcessFailureBehavior::ReportToCaller,`lambda at z:/build/build/src/dom/localstorage/LSObject.cpp:1017:7'>(class mozilla::dom::`anonymous namespace'::RequestHelper::StartAndReturnResponse::<unnamed-tag> * aPredicate = 0x000000a5`cd7ee210)+0x56 [z:\build\build\src\obj-firefox\dist\include\nsThreadUtils.h @ 335] 
08 000000a5`cd7ee1f0 00007fff`9b7053f7 xul!mozilla::dom::`anonymous namespace'::RequestHelper::StartAndReturnResponse(class mozilla::dom::LSRequestResponse * aResponse = 0x000000a5`cd7ee340)+0x19f [z:\build\build\src\dom\localstorage\LSObject.cpp @ 1031] 
# Synchronous call begins here
09 000000a5`cd7ee260 00007fff`9b7049d0 xul!mozilla::dom::LSObject::DoRequestSynchronously(class mozilla::dom::LSRequestParams * aParams = 0x000000a5`cd7ee2e0, class mozilla::dom::LSRequestResponse * aResponse = 0x000000a5`cd7ee340)+0x57 [z:\build\build\src\dom\localstorage\LSObject.cpp @ 696] 
0a 000000a5`cd7ee2b0 00007fff`9b704dc1 xul!mozilla::dom::LSObject::EnsureDatabase(void)+0x110 [z:\build\build\src\dom\localstorage\LSObject.cpp @ 742] 
0b 000000a5`cd7ee3f0 00007fff`9a9d2de1 xul!mozilla::dom::LSObject::SetItem(class nsTSubstring<char16_t> * aKey = 0x000000a5`cd7ee550, class nsTSubstring<char16_t> * aValue = 0x000000a5`cd7ee4c0, class nsIPrincipal * aSubjectPrincipal = <Value unavailable error>, class mozilla::ErrorResult * aError = 0x000000a5`cd7ee4a8)+0x41 [z:\build\build\src\dom\localstorage\LSObject.cpp @ 511] 
0c 000000a5`cd7ee480 00007fff`9803ec3d xul!mozilla::dom::Storage_Binding::setItem(struct JSContext * cx = <Value unavailable error>, class JS::Handle<JSObject *> obj = <Value unavailable error>, class mozilla::dom::Storage * self = 0x000001c4`860ffc10, class JSJitMethodCallArgs * args = 0x000000a5`cd7ee658)+0x2e1 [z:\build\build\src\obj-firefox\dom\bindings\StorageBinding.cpp @ 218] 
0d 000000a5`cd7ee630 00007fff`988c607e xul!mozilla::dom::binding_detail::GenericMethod<mozilla::dom::binding_detail::NormalThisPolicy,mozilla::dom::binding_detail::ThrowExceptions>(struct JSContext * cx = 0x000001c4`fb725000, unsigned int argc = 2, union JS::Value * vp = <Value unavailable error>)+0xfd [z:\build\build\src\dom\bindings\BindingUtils.cpp @ 3062] 
0e 000000a5`cd7ee6f0 00007fff`988c6ceb xul!js::InternalCallOrConstruct(struct JSContext * cx = 0x000001c4`fb725000, class JS::CallArgs * args = <Value unavailable error>, js::MaybeConstruct construct = <Value unavailable error>)+0x27e [z:\build\build\src\js\src\vm\Interpreter.cpp @ 535] 
0f 000000a5`cd7ee7f0 00007fff`988bafc8 xul!InternalCall(struct JSContext * cx = 0x000001c4`fb725000, class js::AnyInvokeArgs * args = 0x000000a5`cd7eeb30)+0xbb [z:\build\build\src\js\src\vm\Interpreter.cpp @ 590] 
10 000000a5`cd7ee850 00007fff`988b8828 xul!Interpret(struct JSContext * cx = <Value unavailable error>, class js::RunState * state = 0x000000a5`cd7eed78)+0x22a8 [z:\build\build\src\js\src\vm\Interpreter.cpp @ 3320] 
11 000000a5`cd7eec30 00007fff`988c62ac xul!js::RunScript(struct JSContext * cx = 0x000001c4`fb725000, class js::RunState * state = 0x000000a5`cd7eed78)+0x198 [z:\build\build\src\js\src\vm\Interpreter.cpp @ 423] 
12 000000a5`cd7eed40 00007fff`988c6ceb xul!js::InternalCallOrConstruct(struct JSContext * cx = 0x000001c4`fb725000, class JS::CallArgs * args = <Value unavailable error>, js::MaybeConstruct construct = <Value unavailable error>)+0x4ac [z:\build\build\src\js\src\vm\Interpreter.cpp @ 563] 
13 000000a5`cd7eee40 00007fff`988c6d34 xul!InternalCall(struct JSContext * cx = 0x000001c4`fb725000, class js::AnyInvokeArgs * args = 0x000000a5`cd7eef18)+0xbb [z:\build\build\src\js\src\vm\Interpreter.cpp @ 590] 
14 000000a5`cd7eeea0 00007fff`98a01ed3 xul!js::Call(struct JSContext * cx = <Value unavailable error>, class JS::Handle<JS::Value> fval = <Value unavailable error>, class JS::Handle<JS::Value> thisv = <Value unavailable error>, class js::AnyInvokeArgs * args = 0x000000a5`cd7eef18, class JS::MutableHandle<JS::Value> rval = <Value unavailable error>)+0x24 [z:\build\build\src\js\src\vm\Interpreter.cpp @ 606] 
15 000000a5`cd7eeed0 000001a9`b473ec4d xul!js::fun_call(struct JSContext * cx = 0x000001c4`fb725000, unsigned int argc = <Value unavailable error>, union JS::Value * vp = <Value unavailable error>)+0x1a3 [z:\build\build\src\js\src\vm\JSFunction.cpp @ 1192] 
# ...
I guess we need to somehow teach next gen local storage to detect mozilla::mscom::MainThreadInvoker (or teach MainThreadInvoker how to notify local storage), but I'm not really sure how, especially because it's in a background thread. Is LSObject::CancelSyncLoop thread-safe?
(In reply to James Teh [:Jamie] (away from bugmail until 7 Jan) from comment #12)

James, thanks for your findings.

> I guess we need to somehow teach next gen local storage to detect
> mozilla::mscom::MainThreadInvoker (or teach MainThreadInvoker how to notify
> local storage), but I'm not really sure how, especially because it's in a
> background thread. Is LSObject::CancelSyncLoop thread-safe?

LSObject::CancelSyncLoop doesn't exist anymore, there's
LSObject::OnSyncMessageReceived and LSObject::OnSyncMessageHandled instead.

Anyway, the way we handle unsafe CPOWs is that we just cancel localStorage
sync request to the parent, so JS gets a failure. We can do this because
unsafe CPOWs can't be used in normal content code, only in some mochitests.
I guess, we can't just cancel it for the a11y case.

I need to better understand the a11y flow, for now I'm thinking about
calling LSObject::GetSyncLoopEventTarget from a11y code somehow, so that
it would use this special event target instead of main queue which is not
processed at that time.
Also, the mozilla::mscom::MainThreadInvoker is used on windows obviously.
What about other platforms, do they use sync requests too ?
Assignee: nobody → jvarga
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
(In reply to Jan Varga [:janv] from comment #14)
> Also, the mozilla::mscom::MainThreadInvoker is used on windows obviously.
> What about other platforms, do they use sync requests too ?

Other platforms use a very different code path. The parent process can make sync IPC calls to the content process in order to answer queries from a11y clients. For example:
1. Client asks for Name, which makes a sync IPC call from parent process: https://searchfox.org/mozilla-central/source/accessible/atk/AccessibleWrap.cpp#519
2. IPDL declaration: https://searchfox.org/mozilla-central/source/accessible/ipc/other/PDocAccessible.ipdl#122
3. Content process implementation: https://searchfox.org/mozilla-central/source/accessible/ipc/other/DocAccessibleChild.cpp#103
Ok, I thought that only CPOWs can do this kind of sync requests.
Tests and QA didn't catch this problem.

For now, we're going to flip the dom.storage.next_gen pref back to false.

Thanks all for help with investigating this issue.
Duplicate of this bug: 1516270
Hi Jan,

Is there any specific and common condition related to hardware, config or anything else that could cause the crashes? I could only reproduce a one time crash on the Nightly from 66.0a1 (2018-12-23) with the same signature but nothing more.

Interestingly I didn't encounter any crashes on Monday after the pref was switched, nor did any of my colleagues who were testing on Nightly. Given the high amount of crashes, specifically in the first 1 minute of using Nightly (right after updating it?) could this be more related to the Update scenario?
Flags: needinfo?(jvarga)
Timea,
Regarding the hang, it's definitely connected with the accessibility as pointed out by James, I'm currently reworking some stuff in LSNG to fix that.
I think we should add accessibility to the list for QA testing.

However, I still don't know what caused all the crashes. I investigated them, they all seem to be the same issue and maybe they happen more on slower hardware. Personally I never experienced the crash and I use LSNG daily with official nightly builds.
I'll work on the crash issue more, once I fix the hang.
Flags: needinfo?(jvarga)
Adding Accessibility to the test plan and will focus on it. Thanks!
Screen Reader (Narrator on Windows) and NextGen enabled will cause Nightly to hang until its closed and will crash in the end. (crash signature: https://crash-stats.mozilla.org/report/index/a29e51f1-fc1e-4d19-8329-55dce0181228)

STR:
1.  Enable screen reader 
2.  Launch Nightly with NextGen storage enabled
3.  Open several tabs or switch between existing ones
4.  Nightly will freeze/hang
For what it's worth, I hit this often on a top of the line Surface Book 2 so I don't think it's restricted to lower-end hardware. It would typically happen after 30 mins of browsing and then within the first minute every time I restarted after that.
Duplicate of this bug: 1515967
My guess is that the crashes related to this all actually begin as hangs.

Note that we don't officially support Narrator as a screen reader. While hanging or crashing is obviously undesirable with any configuration, I'd suggest you test with a supported screen reader (NVDA or JAWS) so as not to conflate this with other possible problems with unsupported configurations.
Component: Untriaged → DOM: Web Storage
Priority: -- → P1
Product: Firefox → Core
Summary: Nightly build downloaded 12/22/18 constantly freezes. Not a problem prior to this download → New localStorage causes hangs when run with accessibility enabled
Firefox will hang with NVDA and JAWS screen readers as well. Tested on Windows  10 x64 with NextGen enabled via pref.

Per Overholt's request, I tested a try build with latest patches and LSNG enabled. I briefly tested visiting various sites in multiple tabs and wasn't able to reproduce this, so I think this is resolved; the deadlock showed up pretty quickly last time. Thanks!

I again see frequent hangs since the new local storage got reenabled for Nightly. Primarily on mobile.twitter.com when switching to it and the tab reloads (e. g. after an update), or it refreshes tweets in the background. Since I have multiple instances in multiple containers open, this can happen quite frequently. I can manage to shut the Firefox app down via the taskbar's jump list menu (on Windows), but it is still not a pleasant experience having to do this several times a day. And yes, Jamie, this is different from the total freeze we sometimes experience in the UIA process that goes havoc. This is only a hang in Firefox, the rest of the system stays responsive.

Marco, I'm sorry to hear that, can you just disable new localStorage implementation (LSNG) by setting the pref dom.storage.next_gen to false ?
It seems we have a fix for this hang, QA confirmed it fixes the problem, but we don't have reviews for it yet.
Just in case you are interested, here's an experimental patch that I'm working:
https://hg.mozilla.org/try/rev/604ae23f04f048db7ce8ee03a5cbb3ffcb675bc7

Jan, thanks for the info! I am currently observing a pattern which might be of interest. I just had a hang while the container tab for Facebook was reloading after an update of Nightly and I switched to it first time during that session. That one even was so bad that the jump list didn't close Firefox, and I had to reboot the machine to actually get it to close. After that, the FB tab and so far all other tabs loaded normally, even with next gen enabled. I don't know if this is related to the transition from old file storage generation to new, perhaps, or not, but thought I'd mention it just in case this is relevant and helps track down the problem further. But your patch certainly looks plausible, too, e. g. some kind of threading race condition... Do you have the bug number for that patch?

Marco, the scenario which you described makes sense. It may happen that first localStorage access after enabling next gen takes more time (by synchronously blocking main thread in a content process) which can maximize the probability of the deadlock.
I just filed bug 1534208 for this.

Hey Marco, I am no longer getting hangs/crashes on the latest Nightly using NVDA. Bug 1534208 seems to fix that in my case at least. Note that I am using the Beta NVDA version screen reader as the latest one has an already know issue with hangs/crashes (Bug 1532610).
Could you test a bit your scenario when you have the time for it? Please let me know how things turned out. Don't forget to re-enable LSNG if you already disabled it and have the latest Nightly 67.0a1 (2019-03-12) (64-bit) (Build ID: 20190312095443) installed. Thanks!

Flags: needinfo?(mzehe)

Good news, I can also no longer reproduce the hangs or crashes on the two Windows machines I was seeing the hangs in previous builds. Everything runs very smoothly also when reloading tabs or restarting Nightly and loading the tabs again bit by bit.

Flags: needinfo?(mzehe)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

Based on Comment 32 and Comment 33 will close this as Verified - Fixed. If anyone can still reproduce this issue on the latest Nightly and LSNG enabled please re-open the issue with detailed information.

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.