Closed Bug 453704 Opened 16 years ago Closed 16 years ago

Extreme slowness, "Firefox is already running" error for >3 users launching Firefox in LTSP environment

Categories

(Firefox :: General, defect)

x86
Linux
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 455829

People

(Reporter: jerickson, Unassigned)

Details

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Firefox/3.0.1
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Firefox/3.0.1

I'll just paste in what seems to be the common problem here amongst 4 of 
my schools, when there is a full class of ~35 students logging on and 
trying to launch Firefox (these are all different people sending me 
e-mails):

-----
Just had a class in here and they went on Firefox, it took about 4 
minutes to get them all on.  First 5 popped up right away then it 
started to slow down.  Had 3 stations that after everyone was in I had 
them click on the browser icon and they go right on.
---
Four or five students were able to start working in typingmaster (within 
Firefox).  For the rest, it told them they were already in Firefox.
---
It took 15 minutes for 20 computers to logon to Firefox.  The first one 
took 10 seconds, the next 20 seconds, the 6th one to try and come up had 
the "Already Open" error.  After 20 minutes I went back into 
Apps/firefox with the remaining workstations and all but two went in.  
Workstations s8, s9 have the error message.
---

Tested Firefox today with a class of 20. About half could get on 1 at 
a time, the other half froze up the desktop and kids couldn't even 
logoff Ubuntu.

-----


After a ton of research for the past 3 days, I've come up with the 
following potential fixes which have sped things up just a bit, but 
definitely do not solve the core "Firefox is already running" (and most 
of the slowness) issues:  http://lns.wikidot.com/firefox3ltspoptimizations

These are all beefy HP Proliant servers, 2xdualcore Xeon CPUs, 8GB RAM, 
so horsepower shouldn't be an issue. Gutsy + FF2 worked beautifully, 
this only started happening at the beginning of this school year, after 
I had upgraded all servers to Hardy+FF3 over Summer break.

Reproducible: Always

Steps to Reproduce:
1. Log 10 different users into separate Ubuntu 8.04.1 LTSP client sessions
2. Launch Firefox 3 in all user sessions at the same time
3. 
Actual Results:  
Extreme slowness in launching Firefox, random "Firefox is already running" error message popping up with users, extremely high CPU usage (30-60% usage on 2x dualcore Xeon 1.6GHz CPUs) with all running Firefox processes at idle. Some users will experience "Starting Firefox" on bottom Gnome panel, and then it will disappear (no Firefox gets launched)

Expected Results:  
Firefox should launch normally for all users, with minimal CPU usage, and no "Firefox is already running" errors popping up.

Running 'firefox' from a terminal yields no output.
Just got this additional e-mail from a teacher who has been working closely with me to identify the problem:

"I completely rebooted all computers this morning, had a whole class in and NOBODY could get on Firefox. Then, while they were playing games, 3 suddenly got on Firefox – about 4-5 min. later. Needless to say, not many of my classes are interested in coming in and testing the system anymore. The kids and teachers are becoming frustrated."

Thanks for any help from the Mozilla team.
- Jordan
I have a similar classroom of 32 HP t5135 connecting to Quad Core E5310 Processor     2X4MB Cache, 1.6GHz, Xeon     1066MHz Front Side Bus for    PowerEdge SC1430 with 8G of memory and raid 1 sata drives.  The server is connected via a 1G nic/switch.  

I am not getting the "Firefox is already running" errors.  My observation is that one rouge firefox instance (probably hitting a flash site) can bring the whole system to it's knees or lock up the server.  

This was a fresh install last month of Hardy and the system and ltsp chroot are all up to date.  This is only the 2nd full day of school so I may be able to get more data from the teachers soon.  I put in the firefox3ltspoptimizations today to see if it will make a difference tomorrow.

Munin is running:
http://k2.mtacademy.us:8081/munin/localdomain/localhost.localdomain.html#System

Boot time is 4-5 minutes for each workstation.  I am using autologin with the home directory being cleaned out with each login or reboot of the client.

I would be nice to be able to control the memory and cpu or any one instance of a program running in ltsp.
I run a Linux Lab with 26 TC's using Ubuntu (8.04) Hardy - LTSP-5. (We recenty upgraded). We have a genuinely powerful quad-core Server with 4G RAM and the TC's are all P4's with 512M memory - so there should be no hardware performance problems. 
Firefox (3.0b4) is runs very slow. Many machines give me the""Already Open" error (The way I have been using to get around this is to tell the pupils to logout - I then delete their hidden files, they login and then they can start firefox.) Others wait from 30 seconds to 3 mins to open firefox???

[Aside from this, in one 30 minute lesson, we had 11 mouse freezes and 15 keybaord freezes - only alleviated by resetting client]

I'll keep watching this site for ideas and solutions - My purpose is to let you know that there are many of us out her struggling!!
An update from one of my sites this morning...

---
"My first class today, 5 students got the "Firefox already open" message.
They each waited a few minutes and tried again.  All were in in under 4
minutes."
---

Also, this issue is being discussed in the mailing list "LTSP-Discuss" as it seems to be directly related to LTSP setups. This looks like some helpful troubleshooting information, but I am reluctant to try it until I get some feedback from the Mozilla team. Please let me know if this sounds like a good idea to test, otherwise please advise on something I can do to try and help you guys out in pinpointing this issue.

---
I encountered the same problem at our school with a multiple server
setup of ubuntu hardy and Firefox 3.01. No way to keep students working.

We should collect some more information to ensure not to mix different
issues:

1.)
Is this only a problem with Firefox 3.01 or shows Firefox 2 the same in
hardy?

2.)
What kind of lock files do you get in the various situations? 
(cd to firefox profile,  ls -l | grep lock)
case
a. firefox starts normally
b. firefox start is delayed
c. firefox terminates with message "Firefox is already running"
d. firefox crashes without message

3.)
Are you working on multiple servers with home directories mounted via
nfs? 

4.)
What happenes if students start the browser from terminal with the
command firefox -save-mode ?


Explanation to Question # 2.):

I have looked at the firefox profile and the profile locking mechanism
where I suppose the cause of our problem. I found out that Firefox 3 has
a different mechanism of profile locking. In the profile folder
(/home/user/.mozilla/firefox/xyz.default/) a symbolic link is created by
the starting firefox:

ls -l shows the link named "lock":

lrwxrwxrwx 1 wollw wollw 15 2008-09-08 20:32 lock -> 127.0.1.1:+6164

The link disappeares as firefox exits.

I am not familiar with the strange "->Ip:+Number" syntax, which are the
loopback IP and the process number of the firefox process. Can someone
explain what kind of link this is? Appending text to the file and
reading from it shows that it is a kind of named pipe.

If I remember correctly: Former versions of Firefox had just an ordinary
file named "lock". In consequence to this change a second firefox
process now just openes a new window but gets part of the existing
process while in former versions the second process terminated with the
error message "Firefox already running".

Please have a look at the lock file. In another situation even on the
local machine insted of loopback IP the IP of the interface eth0 is
given:

lrwxrwxrwx 1 wollw wollw 19 2008-09-08 21:58 lock -> 192.168.0.20:+21469

I describe this in fully detail as we have a multiple server environment
with three secondary servers and one primary ltsp server which holds the
home directories (and therefore the users firefox profiles. The
home-directory is mounted by nfs.

There are issues at bugzilla about former versions of firefox with
profiles at nfs mounted locations.

Kai Wollweber
IGS Eckernförde
---

(Not sure why this reply wasn't in the list archives, but most of the thread can be followed here: http://sourceforge.net/mailarchive/message.php?msg_id=48C04301.6060309%40logicalnetworking.net )
(In reply to comment #2)
> I am not getting the "Firefox is already running" errors.  My observation is
> that one rouge firefox instance (probably hitting a flash site) can bring the
> whole system to it's knees or lock up the server.  

This sounds like a completely different bug.

> I would be nice to be able to control the memory and cpu or any one instance of
> a program running in ltsp.

Again, this is not an issue regarding the bug you're commenting on. Let's keep the information relevant, please!
This *could* be a duplicate of the following bugs regarding fsync:

Bug 442967 -  Reduce fsyncs and writes in Places
https://bugzilla.mozilla.org/show_bug.cgi?id=442967

Bug 421482 -  Firefox 3 uses fsync excessively
https://bugzilla.mozilla.org/show_bug.cgi?id=421482

It would make sense, given LTSP is a multi-user environment, all utilizing a single computer, with a single ext3 filesystem, which would definitely be a bottleneck when it comes to excessive fsync'ing. I would imagine that fsync would be even harder on the server while CREATING the firefox profile/places.sqlite, which is precisely where most of the issues are cropping up (new users launching Firefox taking 5-10min to launch, little/no error messages, "Firefox is already running" possibly after they try to launch again).

Any input greatly appreciated.
(In reply to comment #6)

Quite likely.

You can test whether (/ workaround if) this is the cause by setting toolkit.storage.synchronous to 0, but you'll need your own backup strategy if profiles are important.  I don't know whether there's an official description of this pref, but there's a summary here:

http://www.linuxtoday.com/news_story.php3?ltsn=2008-06-05-011-26-NW-DT-RL-0002
Ok - here's the scoop after being on-site and testing out a full lab (35 thin-clients) launching Firefox, with no current ~/.mozilla profile (I nuked them all before testing), with and withOUT pref("toolkit.storage.synchronous", 0); set in /etc/firefox/pref/firefox.js (which will affect ALL users):

----------------------------------------------------------
'toolkit.storage.synchronous' pref NOT set): Running Firefox on local LTSP server with no profile with the following command:

strace -f -e fsync firefox 2>&1 | fgrep fsync | wc -l

Rendered 99 counts of fsync.
---
'toolkit.storage.synchronous' pref SET): Running Firefox on local LTSP server with no profile with the following command:

strace -f -e fsync firefox 2>&1 | fgrep fsync | wc -l

Rendered 28 counts of fsync.
---
'toolkit.storage.synchronous' pref NOT set): Running Firefox on local LTSP server with existing profile (from above) with the following command:

strace -f -e fsync firefox 2>&1 | fgrep fsync | wc -l

And then opening up tabs for slashdot.org, yahoo.com and google.com:

Rendered 48 counts of fsync.
---
'toolkit.storage.synchronous' pref SET): Running Firefox on local LTSP server with existing profile (from above) with the following command:

strace -f -e fsync firefox 2>&1 | fgrep fsync | wc -l

And then opening up tabs for slashdot.org, yahoo.com and google.com:

Rendered 24 counts of fsync.
----------------------------------------------------------

Ok. So now we have sort of a baseline of how the toolkit pref works for fsyncing. Now what I did was simulated a full class of students coming in, logging into Gnome, and, all at the same time, launching Firefox.

Instances of launching a full lab of thin-client firefox sessions, with NO profile, were pretty different when 'toolkit.storage.synchronous' was set as opposed to not. Without it set, we waited for about 15 minutes, and only about 1/2 of the thin-clients had launched Firefox. WITH it set, again, with new profiles (I nuke them every time), it took about 5 minutes for all systems to launch Firefox. Better, but still really really bad.

I used the following command on 3 different thin-clients to guage where firefox was hanging. I saw a commonality through all of them - it would hang at this point in the strace:

---
stat64("/usr/lib/xulrunner-1.9.0.1/components/nsWebHandlerApp.js", {st_mode=S_IFREG|0644, st_size=6920, ...}) = 0
open("/dev/random", O_RDONLY)           = 15
read(15,
---

RIGHT at the "read(15, " it would hang for 3-8 minutes before it proceeded through and load Firefox. "15" was also "19" and "23" - not sure if this matters. Sometimes it would proceed through and load the UI, sometimes it would process a bunch more strace info and then hang on another "read(19, " line. AND, it wasn't always right after the "nsWebHandlerApp.js" line - a couple of other times it was here:

---
open ("/dev/random", 0_RDONLY)          = 19
(read (19,
---

And it would hang.


AFTER launching the UI (finally), I saw (which might be completely normal) a continuous cycle of the following in strace:

---
gettimeofday({1221521746, 829628}, NULL) = 0
read(3, 0x8075324, 4096)                = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN}], 7, 0) = 0
---
It would hang at "poll( "and cycle through these messages.



You can find the 3 different workstations' strace logs, with and without the toolkit pref enabled, here: http://logicalnetworking.net/other/ac/



I really hope this information helps. After seeing this happen with my own two eyes, I have to say...wow.
As per (currently in-progress) conversation with mzz on irc.mozilla.org/#firefox :

---
mzz aha!
mzz you're running out of random, that makes lots of sense
Lns mzz: also see conversation in #places w/sdwilsh re: this issue: http://pastebin.com/m5ea839cc
mzz it needing tons of randomness makes less sense, but running out of it definitely does
Lns mzz: really?
mzz so my fsync guess wasn't exactly spot on
Lns !!!
Lns =)
mzz Lns: reading from /dev/random blocks when the kernel runs out of entropy. Entropy is normally gathered from external sources, like network packet timing and user input events.
mzz Lns: your server has no user input events and probably not that much network traffic going on, but it does have tons of firefox instances (apparently) reading /dev/random
* mzz reads on
Lns mzz: well there is plenty of network traffic as all X11 sessions are over network
* triath (triath@moz-73161D2E.phnx.qwest.net) has joined #firefox
Lns mzz: but definitely not local console input
mzz Lns: blocking on poll is perfectly normal (means it's reached the mainloop and is waiting for events)
* alphaomega (chatzilla@moz-1FF568A3.dsl.emhril.sbcglobal.net) has joined #firefox
mzz now let's see why it's using /dev/random
* XuTMAH (Miranda@moz-A6AA860F.nat.mns.ru) has joined #firefox
* ericjung (asd@moz-369A7852.sub-75-221-104.myvzw.com) has joined #firefox
mzz mmm
* juanb_ is now known as juanb
mzz Lns: what'd help (if it's not using this data for anything absolutely critical) is if it was using /dev/urandom instead of /dev/random
firebot Firefox3.0: 'Linux fxdbug-linux-tbox Depend' has changed state from Success to Test Failed.
mzz Lns: see http://en.wikipedia.org/wiki/Urandom
* Lns clickie-clickie
Lns mzz: well jeez, for something like firefox profiles, why NOT use urandom?? theres nothing security-critical going on.
Lns mzz: is that easy to fix without a new build?
mzz unsure
mzz let me try to figure out what it's gathering randomness for
---
So this is bug 455829?
Keith at GSPS - I tried to follow what the experts have analysed above but there is so much techo stuff. Is there a solution to the firefox problem?

This morning I allowed my Grade 6 class to do their e-mail (web-based) - they all opened firefox 3 - 4 and all hell broke loose!! Long long waiting times - up to 15 mins, messages saying "Firefox is already running", freezing up of others on the LAN who are using other programs. (We are on ubuntu 8.04 LTSP5. All thin clients with no HDD's.

So my question is: Have we reached a conclusion? IS there a solution? In simple terms?
Keith at GSPS
Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago
Resolution: --- → DUPLICATE
This does look like a dupe of 455829 - although this bug outlines a bit better the criticalness of the issue, especially in multi-user environments such as web kiosks / LTSP setups. It looks like there's already a patch in the other bug which will fix the issue (using /dev/urandom instead of /dev/random).
You need to log in before you can comment on or make changes to this bug.