Closed Bug 1428169 Opened 6 years ago Closed 5 years ago

NFS-hosted profiles have issues with localStorage

Categories

(Core :: DOM: Core & HTML, defect, P2)

56 Branch
x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox-esr52 --- wontfix
firefox-esr60 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- fixed

People

(Reporter: terry, Assigned: mak)

References

(Blocks 1 open bug)

Details

(Keywords: hang, regression)

Attachments

(1 file)

Attached file bug1.txt
User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36

Steps to reproduce:

1. Run firefox
2. Do a google search for something.
3. Open multiple links using the middle mouse button in separate tabs


Actual results:

Quite often the tabs content update will freeze and will not update for a minute or two. All tabs are affected.
You can operate Firefox menus and type things in its just the tab content update that are affected.
The Linux system is still running fine with no CPU usage and plenty of memory.
Problem occurs on multiple Fedora 27 systems using stock Fedora Firefox RPM.
It mainly occurs on systems which have a NFS mounted home directory but I have seen it occur on a system with home directory on a local disk.
Attached is a gdb backtrace of some of the threads when the hang occurs.
I have tried moving ~/.cache/mozilla to a local disk. This had no effect.
Severity: normal → critical
Keywords: hang
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Other info:
This has been tested with a newly created user login setup using default Fedora27 Firefox settings as well as for existing users.
There are no Add-ons in use.
I have tried setting the "Prevent accessibility services from accessing your browser" option as mentioned on the web. This did not cure the problem. (Maybe this is a MS Windows thing ?)
I tried setting browser.tabs.remote.autostart.2 to false to disable multi-threaded page update (I think that is what it does). This did not fix the issue.

The bug seems like a thread lock/semaphore timing related issue. It is intermittent. Sometimes it happens almost immediately after starting the browser sometimes it can go for a period before the issue appears. Maybe slower access to the NFS mounted /home (although Gigabit LAN) may affect timings that make this issue happen more frequently ?
I can't reproduce this on Ubuntu and I haven't noticed anything on Fedora on prior use. But I don't have anything that is not mounted locally. 

Terry, on a hunch, could you try setting temporarily security.sandbox.content.level to 2 (from about:config) and see if the problem persists?
Also may I ask if you have upgraded from 56 to 57. If you didn't, could you perhaps temporary downgrade to 56 (http://ftp.mozilla.org/pub/firefox/releases/56.0.2/) and check if the problem is reproducible with that version for you?
Flags: needinfo?(terry)
Thanks for the repl and looking at this.
I will try the sandbox setting tonight.
All Firefox installs are from the standard Fedora RPM repository. I will test the Firefox 56 tar install on a separate system.

In this case I did not upgrade Firefox from 56 to 57. All the systems with the problem were a virgin F27 install from scratch with latest updates and these had Firefox 57 by default.

However, the problem started happening on the previous Fedora25 systems when the Firefox was updated to 57 from the previous version (56 ?) in the Fedora standard repositories.

I am using Firefox 57 on a system with locally mounted /home and it dosen't seem to freeze although I have seen it happen a few times on other systems with a local hard disk. With the NFS mounted /home it happens a lot.
Flags: needinfo?(terry)
A colleague here has tried setting the security.sandbox.content.level to 2 (it was set to 3). This had no effect on the problem on his system.
If just tried running the Mozilla tar version of Firefox 56 on my system (from the link above, untared into home directory and ran ./firefox from there). This runs fine with no freezes on my setup with NFS mounted /home.
NFS mount information:
king.kingnet:/home on /home type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.202.2,local_lock=none,addr=192.168.202.1)
Just in case I tried the Mozilla tar version of Firefox 57.0.4 on my system. This has the same problems as the Fedora RPM version.
A very brief try of 58.0b15 seemed to be better, but its tab contents still froze after a while of use (perhaps for a shorter time).
Since it seems Firefox 56 works fine on your setup, we might be looking at a regression. Would you be willing to run mozregression in order to see if we can find the issue that introduced it?

mozregression install: http://mozilla.github.io/mozregression/install.html

After installing mozregression, it will require parameters to identify good and bad build (applies to Nightly). Given comment 5, I would say this is the right command to go with: 
mozregression  --good 2017-08-02 --bad 2017-11-10
Flags: needinfo?(terry)
We are trying this but are having difficulties seeing the problem with these (It is a bit intermittent). Do you know if mozregression will run the firefox downloads with the profile/cache and all other files in the users home directory or will it use /tmp or somewhere else ?
Flags: needinfo?(terry)
Mozregression by default uses a tmp profile. It is possible though to run mozregression with profile params. If you are using the gui version, then in the Bisection wizard, in the profile window, you have a profile path to browse + profile persistence. To have mozregression using the network profile, you would want to select the value "reuse" for profile persistence and then browse for your network profile.

< for the command line mozregression the command line would be :~$ mozregression --profile /path/to/profile --profile-persistence reuse


Note:
To find the location for the profile you want to use (in your case the network location), from Firefox open about:support, locate the Profile directory section and click on the 'Open Directory'. 
You can check with the above again on the first build if mozregression really uses the selected profile and doesn't get you a tmp one.
Hi two of us have tried this but are not getting anywhere. The mozregression systems needs old shared libraries that are not present on F27. Although we can hack its starting to get a bit dodgy for testing and the tests seem to use python pip to install things rather than rpm and as we have no real idea what mozregression is actually doing and where the profile and cache directories it uses are its getting very complex and dodgy. The cache and profile directories probably have to be on NFS to debug this.

So we thought we would just download and run the nightlies manually from: http://ftp.mozilla.org/pub/firefox/nightly/2017/.
However these don't make sense. We have tried the *-mozilla-central/ ones but the version seems to jump from 57.0a1 on the 2017-09-20 to 58.0a1 on the 2017-09-23 without any interim versions being present. So we can't see how to try any of the interim versions.
Some more info. Moving the firefox profile from NFS to a local hard disk fixes the issue. So there is some issues with accessing the profile when on an NFS disk (file locks ?).

Any ideas on how best to debug this ?
I can build firefox from source and add a few printf's to see where it is locking, but have no idea where to look in the code.
Some other notes. We are using NFSv4 with default mount options.
I note on the NFS server (where /home is mounted from) that /proc/fs/nfsd/nfsv4leasetime = 90. I wonder if a firefox sqlite3 database access is not being unlocked properly (thread stops without closing locks ?)  and firefox only recovers when the NFSv5 lock timeout expires. 90s could be the sort of period for these lockups.
One item that may or may not be relevant, when the lockup occurs it appears the following message is output by firefox:
(/usr/lib64/firefox/firefox:2239): dconf-WARNING **: Unable to open /var/lib/flatpak/exports/share/dconf/profile/user: Permission denied
Other bits of info:

1. I tried setting "Content process limit" to 3 in the Performance preferences (by default it was 1). When I saw the problem happen only one of the TABS was locked others were operating and updating ok. So the fault seems to lie within an individual content processing task.

2. While typing this firefox locked a few times for about 10-20 secs.

3. I tried an NFSv3 mount with the option local_lock=all. This had no effect on the issue.

4. Once while quiting Firefox during a freeze I saw the error message:
[Parent 5877, Main Thread] ###!!! ABORT: file resource://gre/modules/Sqlite.jsm, line 374
[Parent 5877, Main Thread] ###!!! ABORT: file resource://gre/modules/Sqlite.jsm, line 374
ExceptionHandler::GenerateDump cloned child 6127
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::WaitForContinueSignal waiting for continue signal...

5. Sometimes Firefox has frozen just after starting up with the GUI frozen such that bookmarks in the Bookmarks Toolbar cannot be accessed.

6. I often see lots of messages like:
###!!! [Parent][MessageChannel] Error: (msgtype=0x150083,name=PBrowser::Msg_Destroy) Closed channel: cannot send/recv
###!!! [Child][MessageChannel] Error: (msgtype=0x2400FA,name=PContent::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
###!!! [Child][MessageChannel] Error: (msgtype=0x2400FA,name=PContent::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
###!!! [Child][MessageChannel] Error: (msgtype=0x2400FA,name=PContent::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
###!!! [Child][MessageChannel] Error: (msgtype=0x2400FA,name=PContent::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
###!!! [Child][MessageChannel] Error: (msgtype=0x4B0026,name=PNecko::Msg_RequestContextAfterDOMContentLoaded) Channel closing: too late to send/recv, messages will be lost

These occur whether or not a freeze occurs, so they are probably not relevant.

7. From other reported Firefox bugs, it appears that there are multiple related issues on MS Windows and Linux where Firefox freezes. I suspect these are all due to the same bug manifesting itself in different ways on the different platforms and/or timings of individual systems.

8. Users here are now using Google Chrome as Firefox is not usable with this bug when NFS /home file systems are used.
Component: Untriaged → Networking
Product: Firefox → Core
Thanks a lot Terry for the in-depth details.

Summing up what we know so far:

1. This is a regression from FF 56 - marking as such (comment 5)
2. The issue occurs when running FF with the profile hosted on a NFS disk drive. - works fine with a local profile.
3. There are several reported issue of freezes across other OS'es, but I'm not sure at this point if they have the same root problem or entirely different bugs.

I moved this issue to Networking since IMO it is the most likely place to get a better idea of how to debug this issue next. Patrick, any ideas of what's happening here and how to move forward with debugging this?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(mcmanus)
From my experience (multi-process, parallel processing/real-time systems etc.) and the info that the system freezes and recovers, and the fact that Firefox has only recently moved to a multi-threaded/multi-process model for content display I would put my money on race hazards, timing, wait loops with the various semaphores/mutexes that will be used. I suspect something locking/awaiting something which won't happen due to the state the other tasks have got themselves into (two tasks waiting on one another for example) until some timeout occurs. This can also account for the other similar reported bugs.
Having a printf where such a timeout occurs would probably be a good start in finding out what is happening.
Is there someone who deals with the multi-tasking/synchronisation side of Firefox ?
I would start by looking at bug 719952 - I know there are a number of issues here.. none of it is really covered by the networking code unfortunately its just a file abstraction that plays poorly sometimes (with sqlite locking if i recall correctly..)
Flags: needinfo?(mcmanus)
I can reproduce it easily when the profile is on nfs. Here is a data from gecko profiler https://perfht.ml/2DDmW4J. I don't know this code so I'm not able to investigate it more.
Component: Networking → DOM
Flags: needinfo?(jvarga)
this is the typical pay off of the sync localStorage API :(

- possibly remove the SyncPreload thing, not sure what effect on performance it has anyway (telemetry is no longer collected, if it was even there for it)
- let the lock wait in WaitForPreload() reasonably timeout (use monitor instead) and reset the usage to be session only on timeout (leading to nondeterministic data loss...) ; block time in GetItem is <14ms for 95% users, waiting for 500ms should cover 99.8% cases regarding LOCALDOMSTORAGE_GETVALUE_BLOCKING_MS
Yeah, we can do the timeout thing.
Flags: needinfo?(jvarga)
Priority: -- → P2
Any news on this, can I try out a proposed fix or can someone point me to the place in the code that will need a fix ?
(In reply to Terry Barnaby from comment #22)
> Any news on this, can I try out a proposed fix or can someone point me to
> the place in the code that will need a fix ?

I can see that StorageDBChild::RecvLoadDone is still received on the main thread, despite that the IPC db code is PBackground'ed.  I no longer maintain DOMStorage so I don't know in what state DOMStorage currently is.

Anyway, based on that, please disregard comment 20.  We can't just remove the SyncPreload thing - see [1], the last paragraph of that chapter.  In hand with it goes away the possibility of the lock->monitor/timeout change.

Both could be done, however, when we move the IPC child recv part actually to a background thread.  

[1] https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/DOM_Storage_implementation_notes#IPC
Summary: Firefox 57, Fedora Linux, freeze on opening tabs → Firefox 57, Fedora Linux, freeze on opening tabs (localStorage.getItem blocks)
See Also: → localStorageIO
(In reply to Honza Bambas (:mayhemer) from comment #23)
> I can see that StorageDBChild::RecvLoadDone is still received on the main
> thread, despite that the IPC db code is PBackground'ed.  I no longer
> maintain DOMStorage so I don't know in what state DOMStorage currently is.

Here's the status of new LS:
https://bugzilla.mozilla.org/show_bug.cgi?id=1286798#c25

The new implementation is using a non main thread in the content process to do the "preload" and the main thread is blocked by creating a nested event target and spinning the event loop, so there will be more options to enhance it by a timeout etc.
Thanks for the info and for the work on this.
So it looks like Firefox in mainstream releases won't be usable with profiles in NFS mounted home directories until version 61, which should be available near July 3, 2018 or later.
Terry, you may want to try setting the boolean "storage.nfs_filesystem" preference to true in the profiles you are using.  The preferences is not in all.js and so will not show up in about:config unless you manually create it.

This causes our SQLite VFS implementation to use the "unix-excl" VFS instead of "unix" which might be happier.  See bug 433129 and section 3.1 of https://www.sqlite.org/vfs.html.

If you can confirm if that improves things for you, that would be useful information.  It may be time to re-visit our linux locking approach, although I do have to warn you that due to recent shifts in engineers available, unless a helpful person like you can focus on it, it might take some time to find someone with spare cycles to improve this scenario.
Flags: needinfo?(terry)
I have tried adding and setting storage.nfs_filesystem to true.
This definitely improves things, it may fix the issue. My simple test scenario I am using runs fine with this setting showing no lockups with the latest Fedora27 Firefox 58.0.1. If this setting is not present the locks happen all of the time.

I will get some of the users here back on Firefox to try it out in a normal usage. Yes, it is always tricky to find people with the time and knowledge to do these things.
Thanks for the idea, this should allow us to use Firefox :)
Flags: needinfo?(terry)
Summary: Firefox 57, Fedora Linux, freeze on opening tabs (localStorage.getItem blocks) → NFS-hosted profiles have issues with localStorage
Searching for weeks why we have freezes after downloading PDFs or open new tabs.
storage.nfs_filesystem true fixed it with our NFS /homes

Thank you for this pointer.
See also bug 1432484 about freezing with user profiles on NFS. I'm wondering whether the two workarounds are equivalent or complementary:
* setting the environment variable NSS_SDB_USE_CACHE,
* setting the boolean "storage.nfs_filesystem" preference.
NFS client info in duped https://bugzilla.mozilla.org/show_bug.cgi?id=1448844#c14 that we should factor in.

If anyone on this bug would also like to chime in with info like the below, it would be helpful and appreciated:
- The NFS server program and version in use.
- The client program and version in use.
- Any special server settings, client settings, or mount settings.  For example, mount supports "nolock" which sounds appropriately terrifying.
- Does lockd or statd seem broken on the server?  For example, http://sophiedogg.com/lockd-and-statd-nfs-errors/ calls out Firefox experiencing problems after the server's state files seemed to get corrupted.
Since setting storage.nfs_filesystem True Firefox works without an issue for > 20 users with NFS /home on Debian 8 server.
Clients are Debian 9.

We don't see any issue with another program.
This will be fixed in Firefox 62 when bug 1472722 lands, changing us to use the alternate locking mode by default.

The existing "storage.nfs_filesystem" pref will be obsoleted and will no longer have an effect once the change lands, with all profiles behaving as if the pref had been set to true in pre-landing builds.  It's safe to leave the obsolete pref around in profiles, and probably advisable for a few releases in case the change ends up getting backed out, etc.

A new "storage.multiProcessAccess.enabled" pref is being added that when set to (bool) true will revert to the original pre-landing behavior equivalent to "storage.nfs_filesystem" not being set/left as its false default.  The only reason one might want to do that would be if trying to interact with SQLite databases for debugging purposes.
Depends on: 1472722
(In reply to Andrew Sutherland [:asuth] from comment #33)
> This will be fixed in Firefox 62 when bug 1472722 lands, changing us to use
> the alternate locking mode by default.

Sorry, I of course mean Firefox 63 which is currently the mozilla-central branch since June 25th, perhttps://wiki.mozilla.org/Release_Management/Calendar.
Thanks for letting us know and thanks to all who have worked on this :)
Exclusive locking is now enabled by default on Linux
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
(In reply to Andrew Sutherland [:asuth] from comment #33)
> This will be fixed in Firefox 62 when bug 1472722 lands, changing us to use
> the alternate locking mode by default.
> [...] 
> A new "storage.multiProcessAccess.enabled" pref is being added that when set
> to (bool) true will revert to the original pre-landing behavior equivalent
> to "storage.nfs_filesystem" not being set/left as its false default.

I'm currently using 63.0.3 on Ubuntu and don't see it in about:config. Am I missing something? Based on cursory testing, my Firefox still seems to be suffering from the previously reported symptoms (website UI hangs while browser UI remains responsive; inability to save bookmarks etc.)
Component: DOM → DOM: Core & HTML

Hi,
I try Firefox ESR 68.0.1 with profile stored on an NFS mount, the problem persists. Any news ?

Regards

(In reply to salleman from comment #38)

I try Firefox ESR 68.0.1 with profile stored on an NFS mount, the problem persists. Any news ?

Please file a new bug at https://bugzilla.mozilla.org/enter_bug.cgi?product=Core&component=DOM%3A%20Web%20Storage and in your comment please:

  • Indicate what happens when you visit https://firefox-storage-test.glitch.me/ with the profile in question. Specifically, do we return "Good" for any/all of the 4 subsystems?
  • Do you get the same result with a fresh profile when visiting the site? You can use about:profiles to create a new profile and launch that profile in a separate Firefox instance from the UI.
  • Anything about the NFS setup that might be notable, such as if either the client machine you're using or the server is using very old/unusual software/etc.
You need to log in before you can comment on or make changes to this bug.