Pid-signature on profile lock not unique (was: No option to ignore profile lock), results in "profile is in use" message after crash

RESOLVED FIXED in mozilla1.8alpha3

Status

P1
critical
RESOLVED FIXED
17 years ago
3 years ago

People

(Reporter: BenB, Assigned: roc)

Tracking

Trunk
mozilla1.8alpha3
x86
FreeBSD
Dependency tree / graph
Bug Flags:
blocking1.8b3 -
blocking-aviary1.5 +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [cb] no progress for 1.8b3?)

Attachments

(2 attachments, 3 obsolete attachments)

(Reporter)

Description

17 years ago
Reproduction:
1. Make it so that the profile is locked, but Mozilla is not running. That
sometimes happens during crashes (e.g. of the machine).
2. Start Mozilla, selecting that locked profile

Actual result:
Mozilla tells you that the profile is locked, because another instance of
Mozilla is using it. That is untrue (there is no other instance). There is no
way in the UI to ignore the lock and start Mozilla with that profile
nevertheless. Without going to the filesystem, you cannot use the profile
anymore at all - it is lost.

Expected result:
1. Mozilla checks, if there is actually another Mozilla instance running.
- If no other instance found
  2. guess that the lock is a leftover from an abort.
  3. Tell the user so, warning about potential profile corruption and warning
     that there might indeed be another Mozilla instance that we just didn't
     find, e.g. a Mozilla on another computer, which uses the same profile
     on a network-shared folder.
  4. Allow to stop/proceed, default proceed.
  5. Remove lock, if proceed.
- If we another Mozilla instance found running
  2. Pop up the dialog we have today
  3. The dialog offers a way to ignore the lock, on the user's risk.
  4. Default is to honor the lock and not to proceed.

I see that this is a bit of work. The most important part are the lalst 2 steps
- allowing a way to override the lock in the current dialog.

Workaround:
1. Remove the file "lock" in your profile.

Severity:
The should not be hard to implement, and it easy to work around, but normal
users will not find that workaround, which makes their emails and bookmarks
practically (almost) unaccessible, which is practically major dataloss for them.
Thus, severity critical, although it's surely arguable. In any case, I think
this should be fixed soon - by far worse than a crash IMO.

Additional Comments:
- Might be cross-platform, dunno.
- I surely appreciate that we seem to allow 2 instances of Mozilla with
different profiles at the same time.
AFAIK after a comment from ccarlen's we will not add an ignore.
(not as pref, hidden pref or something else)

You can file a bug that mozilla should detect that the profile is not in use.
(and not critical, since there is no dataloss, memory leak or crash)
Severity: critical → major

Comment 2

17 years ago
> Mozilla tells you that the profile is locked, because another instance of
> Mozilla is using it. That is untrue (there is no other instance).

this part is a bug, but I have never seen it (except for bug 145061).  Mozilla
checks the pid the lock file claims to have and if it does not exist, ignores /
removes the lock file.

What you're seeing could easily happen over NFS.

see also bug 122698
(Reporter)

Comment 3

17 years ago
I saw this at least twice since my machine locks up hard occasionally recently.
I use Mozilla 1.0-code.
No NFS or network shares involved. There definitely was no other Mozilla
instance, because the machine just booted.

> (and not critical, since there is no dataloss, memory leak or crash)

As I outlined above, this bug equals complete dataloss of the whole profile for
normal users without external help.
Severity: major → critical
(Reporter)

Comment 4

17 years ago
->backend
(As we speak, it happened again.)
Assignee: ben → ccarlen
Component: Profile Manager FrontEnd → Profile Manager BackEnd
Sounds like a problem in the fix for bug 76431. We do check that the process
that acquired the lock is alive so crashing while holding the lock should not be
an issue - except perhaps if some other process gets the same process id next
time around. There are also signal handlers installed which should remove the
lock in the event of a crash. I need to check and see whether pid rollover could
be to blame.

Anyway, I don't think there should be an option to ignore the lock. The locking
code needs to avoid getting into this situation and be absolutely fail-safe. 
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.1beta
Brendan, check out comment #5. Does process id rollover seem likely? To avoid
this on the Mac, I record in the lock file both the process id *and* the time at
which the process was launched in order to prevent this.

Comment 7

17 years ago
Ben, when this problem occurs, could you check the lock file?  Is it pointing to
a PID that exists?
I do not believe PID rollover can happen on modern Unixes.

Someone needs to dig a little more here, diagnose the problem, and not call for
a bogus solution (a way to ignore the lock).

/be
(Reporter)

Comment 9

17 years ago
> could you check the lock file?  Is it pointing to a PID that exists?

Yes, it was in at least 2 instances.

> I do not believe PID rollover can happen on modern Unixes.

It is even likely, if you consider the case in question (machine crash, reboot).
PIDs are assigned sequentially, so it is very likely that processes started
during the boot get the same PID as the last time. Extreme case: init is always
PID 1. Similarily, if the user always starts Mozilla directly after the login.

BTW: I notice that the lock is not a file, but a symlink to "<ip>:<pid>". Not
sure, if that's legal, but it's unusual - usually, PID files have the PID as the
(only) content (see /var/run/).

Comment 10

17 years ago
Win98, Mozilla 1.1a: 
Mozilla crashed several times when trying to get Enigmail working (usually IPC 
console freeze/crash) which eventually ended by killing the process (winoldap), 
killing Mozilla simultaneously. After several of these types of crashes, I saw 
this bug. When launching Mozilla, the profile manager comes up asking me to 
select a profile (w/ only one profile to select from) and when I click "Start 
Mozilla", I get the following Alert:

"Mozilla cannot use the profile "earthsound" because it is in use. 

Please choose another profile or create a new one."

I have no other instance of Mozilla running. I will see if rebooting the box 
will do any good and get back to this bug...

Comment 11

17 years ago
After a reboot, Mozilla launched incident free. 

Comment 12

17 years ago
i'm nearly certain that i had pid rollover in FreeBSD 4.4 while i was at 
worldgate. otoh my box was up for nearly my entire term there (no reboots, no 
panics, no kernel updates, no power failures).

QNX seemed to use 10digit numbers for pids and whether it actually could roll 
over would probably not matter since it didn't seem likely that randomness 
would give you a reasonable chance to hit the same pid.

ignoring locks would be stupid, but having a capture/replace lock option might 
be reasonable.

Comment 13

17 years ago
*** Bug 145363 has been marked as a duplicate of this bug. ***
*** Bug 157223 has been marked as a duplicate of this bug. ***
> I do not believe PID rollover can happen on modern Unixes.

As a note, just building mozilla a few times will lead to PID rollover on a
typical RedHat Linux install running on the Intel PC arch.... (yay all those
shell  and make processes).

Comment 16

16 years ago
Are multiple instances of Mozilla on DIFFERENT machines sharing a single home
directory via NFS simply not allowed?

If so, the locking design needs to be seriously re-examined..

Comment 17

16 years ago
> Are multiple instances of Mozilla on DIFFERENT machines sharing a single home 
> directory via NFS simply not allowed?

Absolutely.  The bug I think you're looking for is bug 135137
*** Bug 163854 has been marked as a duplicate of this bug. ***
I still don't think any UI should let users override the lock.  If we had to do
that, we'd be in a situation where non-experts would choose "override" out of
ignorance.  Experts can rm.  This bug should be about qualifying the pid-sig in
some backward compatible way.

Linux kernel helpers (shaver!), is there no way to get the guaranteed-unique, 32
bit pid from the system?

/be
Summary: No option to ignore profile lock → Pid-signature on profile lock not unique (was: No option to ignore profile lock)
> This bug should be about qualifying the pid-sig in some backward compatible way.

I agree completely. Can we just encode the launch time of the process which
acquired the lock into the string in addition to the PID? That way, we can parse
for two tokens, and if we find only one, it's an older app without this change
and we just have to believe the lock is valid. All we need is, given a PID, find
out the time at which that process was launched. Anybody know how to do this?
(Reporter)

Comment 21

16 years ago
ccarlen, I am not sure what you are talking about. This bug is about the fact
that my system locked up and Mozilla left a lock in the profile. After reboot, I
tried to start Mozilla again and I couldn't, because the lock prevented it.

So, this has absolutely nothing to do with PID rollover (which means - I infer
from your discussion - the same PID for 2 different processes *during the same
OS session*). PID rollover could just be another, very unlikely, case where the
same result appears.

I don't think that my case is an "expert" situation. Computers can crash hard
for newbies as well. Consequently, we should not rely on expert knowledge to
recover from that situation.

Comment 22

16 years ago
> I am not sure what you are talking about

> encode the launch time of the process

in addition to recording the PID of the mozilla process, record the start time.
 this would resolve PID rolover AND PID re-use after reboot.

> Anybody know how to do this?

I think this would be very platform-specific.  On Linux, you could look in
/proc/(PID)
> So, this has absolutely nothing to do with PID rollover

I think it must. The only way that this could happen is that mozilla left a lock
file, the system crashed and was rebooted and, next time around when mozilla
checked the PID of the lock file, some other process now had that PID. Knowing
the time at which the process with that ID was launched would allow us to detect
this.
(Reporter)

Comment 24

16 years ago
OK, and what do you do in that case? You see "There's a lock, and it's not from
me." What then?
If the launch time enocoded in the lock does not match the launch time of the
now running process with the PID which is encoded in the lock, you know the lock
is invalid and claim it.
*** Bug 156836 has been marked as a duplicate of this bug. ***
*** Bug 156293 has been marked as a duplicate of this bug. ***
*** Bug 183903 has been marked as a duplicate of this bug. ***
*** Bug 189891 has been marked as a duplicate of this bug. ***
Setting to milestone that's not passed.
Target Milestone: mozilla1.1beta → mozilla1.4beta

Comment 31

16 years ago
Setting All/All per comment 26.
OS: Linux → All
Hardware: PC → All
> Setting All/All per comment 26.

Yeah, but now that the change mentioned in that comment has been made, it's no
longer All/All.
OS: All → Linux
Hardware: All → PC

Comment 33

16 years ago
Mozilla 1.3b, Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3b) 
Gecko/20030210, Windows 2000

I grabbed a fresh install of Mozilla 1.3b and ran the GRE installer while 
Mozilla was running in it's first instance.

The GRE installer prompted me to shut down the running instance of mozilla, 
which I agreed to, and the GRE installer proceeded to shut down the exisiting 
instance after notifying me about Quicklaunch. 

After GRE installation completed, I relaunched Mozilla and was greeted by the 
profile manager. The Default profile was selected and clicked Start Mozilla. 
Instead of Mozilla, I was surprised by an alert which read:

"Mozilla cannot use the profile "default" because it is in use.

Please choose another profile or create a new one."

Mozilla's Quicklaunch was not running when I tried the 2nd launch and a Mozilla 
wasn't visible within the Windows environment, nor was Mozilla listed in Task 
Manager>Applications. I could only start Mozilla after I killed the mozilla.exe 
process still listed in Task Manager>Processes.

I will uninstall and see if I can repeat this. Would any patch applied toward 
this bug touch upon this manifestation of this bug, or should I file a new one?
(Reporter)

Comment 34

16 years ago
> I could only start Mozilla after I killed the mozilla.exe 
> process still listed in Task Manager>Processes.

earthsound, what you saw is a different bug. In your case, there actually was a
Mozilla still running. In my case, there were none, but the lock was
nevertheless there.

Comment 35

16 years ago
*** Bug 193136 has been marked as a duplicate of this bug. ***
*** Bug 198255 has been marked as a duplicate of this bug. ***
*** Bug 203733 has been marked as a duplicate of this bug. ***

Comment 38

16 years ago
My bug (bug #203733) got closed as a dup of this one.  I would encourage
the folks working on this bug to look at the comments I put there.  I'm
not sure that it is truly a duplicate, as the comments in this bug report
seem to state that this is about fixing PID-related problems.

To summarize bug #203733: the fact that the IP address is recorded
seems to cause another variant of the PID-wrapping/PID-replay problem
documented here.  This is my reading of the code at
http://lxr.mozilla.org/mozilla/ident?i=USE_SYMLINK_LOCKING:

2099                         if (addr != inaddr.s_addr)
2100                         {
2101                             // Remote lock: give up even if stuck.
2102                             break;
2103                         }

Timestamps don't solve the networking problem I've outlined.  Users
with profiles on NFS servers are a fact of life, and users migrating
between client nodes is a part of that.  Even if you claim "no we don't
support that", DHCP is *also* a fact of life, and the existing scheme
doesn't seem to work on a host whose IP address changes frequently, as
is the case with many laptop users.

When the IP addresses conflict, I don't see how you can reasonably
fix this without a UI tweak.  Perhaps you could only activate the UI
option if the code detects that the local and lock IP addresses are
different; this would alleviate some of the non-expert vs. expert fears.
More generally, I disagree with Brendan's comments about experts; case
in point:  I got a call from my manager, who is not an "expert", but
who ran into this problem.  It's not her fault.  This bug then caused
me to spend time on figuring out and fixing the problem.  How is that
a good user experience for either her or me?  No one likes having to
call up the local nerd to fix his/her browser.

Thanks for working on this, I'm glad to see it being addressed.

Comment 39

16 years ago
1. ping - a proposal to handle ip address changes in a network environment.
when we examine the lock file and determine that the ip address doesn't match
the local computer, try pinging the referenced computer.

if it's down, we ass-u-me that the lock is bogus.
this could fail for a worldwide nfs server where someone uses a profile in one
place, and then moves to another corporation which blocks icmp. but i don't
think that would be very likely.

2. readonly access - an offtopic proposal to handle annoyed users
(this is yet another reason why it might be useful to support readonly access to
profiles.)
Instead of telling users that they can't use a profile when we can't get a lock,
tell the user that their changes will be lost, but let the user use the profile.
i believe n4 did this.

Comment 40

16 years ago
*** Bug 203832 has been marked as a duplicate of this bug. ***

Comment 41

16 years ago
timeless wrote:
> 1. ping - a proposal to handle ip address changes in a network environment.
> when we examine the lock file and determine that the ip address doesn't match
> the local computer, try pinging the referenced computer.

Erm... what will happen in the case of DHCP ? Just guess, please... =:-)

Comment 42

16 years ago
I'm sorry, I don't think the ping proposal is really viable; it certainly
doesn't solve the university case, where a student is moving about from
host to host in a computing cluster.  In particular: computer A crashes
while student is running mozilla.  So she moves to computer B and then
gets this message, since computer A still "holds" the lock.

At the very least, you could provide the user the hint:

This profile is locked by a copy of mozilla running on the remote
computer "foobar" (PID 12345).  If you are SURE that it is not in
use, you may need to remove the file
"/home/<user>/.mozilla/5t1x4ov6.slt/lock" and restart mozilla.


Comment 43

15 years ago
*** Bug 215257 has been marked as a duplicate of this bug. ***

Comment 44

15 years ago
some thoughts:

1. Mozilla should still try to tell whether a lock is bogus or not even
knowing in some cases it couldn't at all. Once a lock is determined to
be bogus, it should be cleared without know of the user. ( In local host
case we could tell even a pid wrap happens)

2. Since mozilla never know whether a lock is bogus or not in the cases
of nfs mounted home directory and such case is very common in enterprise
environment, means should be provided to allow the user to delete the
lock if he is sure it is safe. If we don't do this, such users get very
bad user experience.

3. About what means provided to user. a message saying something like
" An instance of mozilla running on "Hostname", PID=12345, is using this
profile, if you are very sure it's not in use any more, you need to
remove the file /home/..../.mozilla/xxxxxx/lock" and restart mozilla".
Providing an button which directly clear the lock file may be dangerous,
allowing for many users is likely to press a button before reading
carefully the alarms and the severe results of running more than one
mozillla on same profile.

4. Is there a resolvement to know whether a lock is alive or not in
network environment? how about heart beat timestamp? although it has its own
problems too


*** Bug 225157 has been marked as a duplicate of this bug. ***

Comment 46

15 years ago
I would have really liked Dan's "Hint" message (comment 42) when I suffered this
problem yesterday (Linux hard crash, machine rebooted). Although the lock
symlink was broken, and therefore drawing a lot of attention in my colorful "ls
-al" command, I felt quite nervous about removing this file with no easily-found
documentation.

IMHO, Dan's suggested text has two problems:
  (1) His message text states that it is a remote computer; this assumption may
not be valid. But, if DHCP has assigned a new address to the local machine in my
scenario, we can't use the address to confirm that it is local, either. Just
list the machine addr, don't pin it down either way.
  (2) The phrase "you may need to" is too coy: If you are SURE that the profile
isn't in use, and that it is the one you want to use, then "you should" remove
the file.

If the Severity is "critical", I feel that we should at least provide
documentation of the work-around in the pop-up. Struggling to create code which
handles duplicate PIDs and DHCP addresses seems much less important to me. The
bug has been open for 15 months!

Please advise if a separate bug is approriate, although I'm not very experienced
with creating these things.

BTW, if this bug summary contained the words "in use" rather than "lock", I
would have found it. The current Profile Manager error message *DOESN'T CONTAIN*
the word "lock".

Comment 47

15 years ago
*** Bug 226598 has been marked as a duplicate of this bug. ***

Comment 48

15 years ago
*** Bug 228064 has been marked as a duplicate of this bug. ***

Comment 49

15 years ago
[ I am submitter of bug #228064 in which I described a desired behavior close to
comment #44 ]

Just adding a comment to insist on the fact that this issue is *critical* for a
beginner.

I posted this issue after someone sent me a mail saying :

"I'm writting to you because I have a BIG problem with Mozilla.

Yesterday, opening it, it asked me something like chose/configure the account to
use. [...] I don't know what I've done, but now, I can't use my "account" (my
old mails, addressbook, and even my bookmarks) anymore"


So, please

* Add a message explaining the workaround in the *next* version of Mozilla (This
is trivial, so, should be done *right now*).

* Then, it becomes less critical, but it would still be really good to offer an
option to delete the file. (keep in mind that ~/.mozilla is hidden, so,
unreachable for a newbie :-( ... )

Comment 50

15 years ago
I got a new idea, but not sure my assumptions are right.

I think heart beat is the only way to solve this problem. 

1.A mozilla instance must periodly(say every 5 minutes) claims that it owns a
lock by touching lock file( and write some stuff in it if we need). 

2. if a lock has not been claimed for 5minutes(read it and then compare its last
access time and last change time), the lock is thought to be bogus.

assumptions:

1. on a net fs environment, the access/visit time of file is  always from the
net fs server.
2....

Comment 51

15 years ago
I just tried to move my mozilla profile between an old machine and my new one,
and got bitten by this. I simply copied /home/rjn/.mozilla from one to the
other, and for some reason, there was still a stale lock file in place.

I agree with #42 - the profile manager should give more information in the error
message, so that the user at least knows why there is an "erroneous error".

Comment 52

15 years ago
*** Bug 234199 has been marked as a duplicate of this bug. ***

Comment 53

15 years ago
*** Bug 235216 has been marked as a duplicate of this bug. ***

Comment 54

15 years ago
*** Bug 239615 has been marked as a duplicate of this bug. ***

Comment 55

15 years ago
The summary here is strange: "Pid-signature on profile lock not unique",
The problem described in this bug happens with unique locks as well.

I had a system crash yesterday with a 1.6 build running, and after reboot a
stale lock was left around. Mozilla did not want to start while the lock existed.
The PID number indicated by the lock was way higher than the PID would have been
had i managed to start Mozilla anew. In the previous session, where it crashed 
and left the stale lock, Mozilla and several other processes had been started
and terminated for weeks on end. The PID of the lock was 27000-something.

Comment 56

15 years ago
(In reply to comment #55)
> I had a system crash yesterday with a 1.6 build running, and after reboot a
> stale lock was left around. Mozilla did not want to start while the lock existed.
> The PID number indicated by the lock was way higher than the PID would have been
> had i managed to start Mozilla anew. In the previous session, where it crashed 
> and left the stale lock, Mozilla and several other processes had been started
> and terminated for weeks on end. The PID of the lock was 27000-something.

what confuses Mozilla is if the PID of the lock file corresponds to an existing
process (any existing process).  If you had no existing process using the lock's
PID, then it's a different (and more serious) bug.
(Reporter)

Comment 57

15 years ago
Andrew, rkaa said her PID file was for a PID 27000-something, and she had the
problem after reboot. I don't think Linux assigns a PID of 27000 right after
booting.

rkaa and the rest: See comment 9 and 21. My problem was (IIRC) that you can have
a stale, invalid lock pointing to the PID of your own and only Mozilla process,
and it's even a likely case.

Start computer
Start Mozilla right after boot (automatically or manually)
Machine crashes hard
Reboot
Start Mozilla right after boot (automatically or manually)

In that case, it's likely that you run into this bug. Mozilla will likely leave
a stale lock behind during the crash, and the Mozilla instance after reboot is
likely to get the exactly same PID as during the last boot. No need for "PID
rollover" or anything freaky or otherwise statistically unlikely.


And, this bug means massive dataloss for the average user. He/she can't get to
the profile, including stored emails, *at all* anymore. Mozilla will refuse to
use it, even after re-installing Mozilla. Only after searching the net on user
help sites will he/she ideally find the hint what to do. So, if the average user
who runs into this problem either says goodbye to *all* Mozilla data or is up
for major trouble and time-sinks.

And it would be so easy to fix...
(Reporter)

Comment 58

15 years ago
Please disregard my last comment. While it's a valid problem on its own, I must
have run into other problems as well. Sorry, too late here.

Comment 59

15 years ago
I was going try implementing fcntl file locking (the "correct" way to do profile
locking), but the code is already there.  It's just avoided in deference to old
NFS servers, or those not running lockd.  The "old" NFS servers would be those
from 2.0 kernels (user-space NFS daemon).  Early 2.2 kernels were apparently
less than optimal as well.

Has enough time passed that we can depend on people:
1) running on their local machine
2) running with newer NFS (we're two major kernel versions beyond "old" kernels)

Is it possible to detect such old nfs implementations and just use symlinks then?
No longer blocks: 189595

Comment 60

14 years ago
Please refer bug 253950 regarding a proposed change in the UI behaviour that
would slightly alleviate the impact of this issue in some cases.  I submitted
that bug for Firefox but Mozilla's behaviour is analogous.

Comment 61

14 years ago
*** Bug 246077 has been marked as a duplicate of this bug. ***

Comment 62

14 years ago
*** Bug 208162 has been marked as a duplicate of this bug. ***
*** Bug 261106 has been marked as a duplicate of this bug. ***
*** Bug 264661 has been marked as a duplicate of this bug. ***
Tweaking summary to hopefully catch some dupes. Sorry for spam.
Summary: Pid-signature on profile lock not unique (was: No option to ignore profile lock) → Pid-signature on profile lock not unique (was: No option to ignore profile lock), results in "profile is in use" message after crash
*** Bug 257306 has been marked as a duplicate of this bug. ***

Comment 67

14 years ago
*** Bug 264793 has been marked as a duplicate of this bug. ***
it seems to me like the real problem was fixed in bug 265313, and was not caused
by a reused pid.

Comment 69

14 years ago
I don't think so.  There's no indication that locking is broken per se.  FWIW,
the lock is subsequently (a day or two) released.
(Reporter)

Comment 70

14 years ago
No, biesi, I don't think so. This bug was observed in a case where

*******
the machine crashes (and so the lock did not get removed) and after a reboot of
the machine, the new Mozilla instance got exactly the same PID as before reboot
- several times.
*******

So, the PID in the lock file will point to a valid, running Mozilla (in fact the
checking Mozilla itself in this case, but not necessarily), but it's still not
the same process that created the lock file, the latter process crashed before
reboot.

Here's how it can happen (I already mentioned it in comment 9, but I'll try with
more words): Look at |ps ax|. Note that the processes (sorted by PIDs) appear in
the order they were started. The processes started automatically from initd
during boot will come first. If you write down their PIDs, reboot, do the same
again, you may find that they have exactly the same PIDs again, although you
rebooted in the meantime, so they are definitely different processes. Whether
the PIDs do happen to match exactly, may depend on your OS (kernel; how boot
scrips are implemented, e.g. if it runs a |rm| separately over each of the files
in /tmp/, whose number will vary). But *if* PIDs are assigned sequentially
without randomness (which I think is the case), and *if* exactly the same
processes get started in the same order after reboot (which is also not unlikely
due to initd and co), it's likely that your syslog, apache and kdm and maybe
even the GUI apps started by KDE session manager or another method all have the
same PIDs they had after your last boot. At this point you already have a
problem, if you have an old lock file lying around.

Now add to the equation a machine that has hardware problems, thus crashes hard
very often. This makes the problem more likely to appear. Mozilla will not have
had a chance to remove the lock file (nothing happened anymore, not even signal
handlers or the kernel), and you perform the same boot sequence a lot of times
(because of a lot of crashes in a row), *and* due to the kernel crashing with
the machine, you as a user *assume* that there is dataloss, so it's not all that
suspicious when Mozilla tells you that you can't use your profile anymore, so
you (as a by now terribly unnerved and unpatient end user) are more likely to
write it off and start a new one, although your profile might have been just fine.
This was my case (to my knowledge), so I am not constructing this. Granted, the
machine crash is a hazardous environment to live in, but this is an unnecessary
problem that Mozilla adds, and it can happen in other cases, too.

I'm sure I'm not the only one to start Mozilla immediately after boot. In this
case, the current *idea* of profile locking doesn't work as expected and needs
to be adapted.

Proposal:
If the PID in the lock file exists
- check that it's a Mozilla (don't check for process being called "mozilla",
because it may be called "netscape" or "beonex-comm" or whatever, but use other
methods, e.g. try to communicate with it). If it's not a Mozilla, ignore the
lock file (or allow to override it)
- check that it's not yourself (checking process). If it is, ignore the lock
file. (I'm assuming that every Mozilla process has at most one profile in use.)
Maybe there's a better fix, though, e.g.
Alternate Proposal:
- if the PID in the lock is running, try to communicate with that Mozilla and
ask it, which profile directory it's using currently. Only lock the profile, if
the other Mozilla tells you that it's using the same that you are trying to use.
If the communication fails, because it's not a Mozilla (or the Mozilla doesn't
react anymore), ignore the lock. If you get a such request (e.g. from yourself),
answer 'no profile'.

I still think that a user override (with warnings) is probably a good idea,
until the above gets implemented (this bug is now 2.5 years old) and in general
for other bugs, because no code is infallible. Do not assume that the user
searches on Google about how to remove the lock file.

Comment 71

14 years ago
simpler, just make sure the lock is younger than the pid. if the lock is older
than the pid, then it's bogus.
timeless, that's what I was talking about back in comment 20. I don't currently
have a Linux box to pursue that fix, but if somebody does, I still think it's
the way to go.
Created attachment 170712 [details] [diff] [review]
proposed improvement

This adds a timestamp to the lock symlink's content, which can be compared to
boot time on systems with sysinfo(2) or something like it.  Porting help
needed!

/be
Attachment #170712 - Flags: superreview?(bryner)
Attachment #170712 - Flags: review?(ccarlen)
Comment on attachment 170712 [details] [diff] [review]
proposed improvement

As-is, this will fail to compile on Mac, and probably some other systems as
well.  We may want an autoconf test for sysinfo() if it does exist on systems
other than Linux.  On Mac, I believe you'll want to use sysctl() with
CTL_KERN/KERN_BOOTTIME to get the system start time.  Dunno if other systems
have that sysctl (BSD?)... if so, maybe we should have an autoconf test for
that as well.
Attachment #170712 - Flags: superreview?(bryner) → superreview-
Comment on attachment 170712 [details] [diff] [review]
proposed improvement

Ok, I'll work with a Mac buddy (or steal the wife's laptop and install tools)
to get a Mac-tested patch here.  Comments and porting interdiffs still welcome.

/be
Attachment #170712 - Flags: superreview-
Attachment #170712 - Flags: review?(ccarlen)
Created attachment 170723 [details] [diff] [review]
proposed improvement, v2

This ought to work on OS X -- testing welcome (haven't requisitioned the wife's
G4 powerbook yet! ;-).

/be
Assignee: ccarlen → brendan
Attachment #170723 - Flags: superreview?(bryner)
Attachment #170723 - Flags: review?(ccarlen)
Comment on attachment 170723 [details] [diff] [review]
proposed improvement, v2

>--- nsProfileLock.cpp	12 Nov 2004 19:26:37 -0000	1.7
>+++ nsProfileLock.cpp	9 Jan 2005 05:47:15 -0000
>@@ -49,16 +49,17 @@
> #endif
> 
> #ifdef XP_UNIX
> #include <unistd.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <stdlib.h>
>+#include <sys/sysinfo.h>

This will fail to compile on Mac.  I'm not really in agreement with the
assumption that XP_UNIX means that sysinfo will be available.  This
functionality should be in NSPR or at the very least should be an autoconf
test.

>+#if defined(XP_MACOSX)
>+                            int mib[2];
>+                            size_t size;
>+                            struct timeval boottv;
>+
>+                            mib[0] = CTL_KERN;
>+                            mib[1] = KERN_BOOTTIME;
>+                            size = sizeof(boottime);

boottime is not declared anywhere.  Assuming you meant to use 'boottv'.

This could be condensed some...

int mib[2] = { CTL_KERN, KERN_BOOTTIME };
struct timeval boottv;
size_t size = sizeof(boottv);
Attachment #170723 - Flags: superreview?(bryner) → superreview-
Comment on attachment 170723 [details] [diff] [review]
proposed improvement, v2

Have to get to a Mac tomorrow and test things first.  Also need some autoconf
101 help.

/be
Attachment #170723 - Attachment is obsolete: true
Attachment #170723 - Flags: review?(ccarlen)

Updated

14 years ago
Priority: -- → P1
Ok, I'm bogging.  My patch needs autoconf love and I'm swamped with other,
higher priorities.  Anyone willing to take this into the end zone?

/be

Comment 80

14 years ago
is there a reason not to implement comment 59?
Comment 59 wasn't a concrete implementation plan, unless you ignore the final
sentence: "Is it possible to detect such old nfs implementations and just use
symlinks then?"

If you have an answer to that question, lay it on us.

If you just wanna go for it and require working NFS locking, I'd love to see
that work too -- but it would be better to do that in an early alpha.  Otherwise
who is to say that the number of users exposed to broken NFS locking won't be
too high for us to live with?

/be

Comment 82

14 years ago
actually, looking at the source again, the Mac implementation already falls back
to symlinks, specifically because of old NFS.  The fcntl man page says it should
return ENOLCK if it hits an old NFS implementation (consistent with
http://www.linuxhq.com/lnxlists/linux-kernel/lk_9905_02/msg00150.html), which
would mean the current use in Mozilla would work properly and fall back to
symlinks.  The comment in the source says:

http://lxr.mozilla.org/seamonkey/source/profile/dirserviceprovider/src/nsProfileLock.cpp#548
// First, try the 4.x-compatible symlink technique, which works with NFS without 
// depending on (broken or missing, too often) lockd.

There seem to be 3 reasons stated there: NS4 compatibility, missing lockd and
broken lockd.  I don't see NS4 compatibility as relevant anymore and missing
lockd should trigger a fallback to symlinks.  The lockd would need to be not
only old, but broken (claiming to support remote locking, but not).  
Ok, how about a patch to use the XP_MAXOSX code everywhere?  We'd still need
compile-time tests for fcntl and its alternatives.

/be

Comment 84

14 years ago
I have this working, but with one problem:
If an newer (fcntl'ing) build of Mozilla is running and an older (symlinking)
build of Mozilla tries to use the same profile, it will succeed in creating the
symlink lock file and start.  As currently implemented, the symlink gets a
different filename than the fcntl filename, but even if they were the same, the
symlinking method just deletes anything that's not a symlink.

I can add code to each method to not clobber the other type of lock, but that
won't prevent older builds from doing it.  I guess it could try to get both
types of lock file, but that seems a bit overboard.
(Reporter)

Comment 85

14 years ago
Just FYI, another, far simpler and more obvious way to run into this bug (I just
did): After hard reboot, I have a

$ ls -la lock
lrwxrwxrwx  1 browser users      16 2005-01-25 15:21 lock -> 192.168.3.1:3522
$ ps ax|grep 3522
3522 ?        S      0:00 kdeinit: kio_file file /mnt/ram/tmp/ksocket-ben/klaunch

(192.168.3.1 is my IP address.) This prevented Mozilla 1.7.x from using the
profile. So, no need for complicated situations like in my comment 70 or PID
rollover to reproduce this.

I don't know what "fcntl file locking" is, so I can't say, if it's important.
Flags: blocking1.8b2?
Flags: blocking-aviary1.1?

Comment 86

14 years ago
Created attachment 174189 [details] [diff] [review]
patch to use fcntl file locking for Unix

Updated

14 years ago
Attachment #174189 - Flags: review?(brendan)

Comment 87

14 years ago
If you enable fcntl based locking (which I think is a good idea), then you also
need to write out the symlink file for backwards compat.  We must do this so
long as both new and old versions of our apps locate the profile in the same way.

Comment 88

14 years ago
right (see comment 84).  But if we still create a symlink file, it will still
get left behind and this bug will still exist.

Is it really likely for people to have two different versions of the same app
installed and try to use them simultaneously?

Comment 89

14 years ago
> right (see comment 84).  But if we still create a symlink file, it will still
> get left behind and this bug will still exist.

Yes.  That is why we need to have better UI when that problem occurs.  We need
to tell the user that Firefox appears to already be running, and then we need to
give them the option to continue anyways (asking them to verify that Firefox is
really not already running before proceeding).  Showing the "select profile"
dialog is just awful UI.


> Is it really likely for people to have two different versions of the same app
> installed and try to use them simultaneously?

Yes.  Suppose someone has Firefox 1.0 installed on their system using their
distro's RPMS, and then suppose they also downloaded a fresh new copy of Firefox
1.1 from Mozilla.org.  Now, suppose for whatever reason Firefox 1.1 is not quite
setup with their system properly, so clicking http:// links in other products
like GAIM causes the system version of Firefox to run.  If they had Firefox 1.1
running at the time, they'd see a Firefox 1.0 window popup, and maybe they
wouldn't think anything of it.  They might use it, and in the process they may
end up trashing their profile.  This scenario doesn't seem like an edge case to
me.  (NOTE: we had this problem when we first started doing symlink locking for
Mozilla 1.0.  Netscape 6 would trash the Mozilla 1.0 profile IIRC.)

Comment 90

14 years ago
one more thing:  if symlink locking is a failover for when fnctl based locking
fails, then we need to solve the symlink problem anyways, so I don't see
supporting symlink locking for backwards compatibility as being much more work.
(Reporter)

Comment 91

14 years ago
> clicking http:// links in other products
> like GAIM causes the system version of Firefox to run

Shouldn't the firefox starter binary find a running instance of Firefox,
regardless of version (but regarding app), so starting Firefox 1.0 should just
open a new window in the running Firefox 1.1?

Comment 92

14 years ago
yes, good call.  i forgot about the fact that X-remote is now properly
integrated into the firefox shell script! ;-)

hmm... that may indeed be a convenient way around the problem i was describing.
 can anyone think of a way that X-remote might be defeated?

Comment 93

14 years ago
It's not the binary that looks for running instances but the startup script.  If
you pass any arguments to the script (like -inspector, --sync or -P) it will not
try to do X-remote.  The startup script ought to handle some of them, but it
doesn't.

Comment 94

14 years ago
and Seamonkey script, which doesn't try to do X-remote at all (such intuitive
behavior is undesired)

Comment 95

14 years ago
(In reply to comment #87)

> If you enable fcntl based locking (which I think is a good idea), then you also
> need to write out the symlink file for backwards compat.

I have a suggestion for how to implement this: create the symlink (and remove it
on exit) but give the file it points to a new name, e.g., fcntl.lock.

Then, if the symlink already exists and points to fcntl.lock you know the other
instance supports fcntl locking and can use it to determine whether the lock is
still valid.  If the symlink points to another file, the other instance is an
old version and you need to revert to the old behaviour.

This way you maintain backwards compatibility but still get the improved
behaviour once all instances have the new code.

  Harry.
Flags: blocking-aviary1.1? → blocking-aviary1.1+

Updated

14 years ago
Flags: blocking1.8b2? → blocking1.8b3+
Target Milestone: mozilla1.4beta → mozilla1.8alpha3

Comment 96

14 years ago
Reassigning to Josh per Deer Park meeting.  Thanks Josh!
Assignee: brendan → joshmoz
Status: ASSIGNED → NEW

Comment 97

13 years ago
seawood, can you take a look at this with us?

Updated

13 years ago
Whiteboard: [cb] no progress for 1.8b3?

Updated

13 years ago
Flags: blocking1.8b3+ → blocking1.8b3-

Comment 98

13 years ago
*** Bug 301529 has been marked as a duplicate of this bug. ***
Darin's backward-compatiblity argument doesn't apply if the RPM distributor is
the one who applies this patch :-).
A variant of comment #95 would work well, I think. If the link is to something
like "192.168.199.3:7777:obsolete" then today's Firefox builds would treat it
exactly the same --- a valid lock, if pid 7777 exists or the IP address is
different. But future Firefox builds could recognize that this was not the real
lock, that the process that created it made the real lock using fcntl, and if
the fcntl lock is acquired then the symlink is a stale lock and can be deleted.

Comment 101

13 years ago
http://lxr.mozilla.org/seamonkey/source/profile/dirserviceprovider/src/nsProfileLock.cpp#301

301         char *after = nsnull;
302         pid_t pid = strtol(colon, &after, 0);
303         if (pid != 0 && *after == '\0')

with "192.168.199.3:7777:obsolete", |*after| will be ':' (the second colon), the
link will be assumed to be bogus and deleted.  Any such flag would need to be
more subtle (perhaps setting the date to 1970).

Updated

13 years ago
Assignee: joshmoz → nobody
Good point.

We could make the obsolete flag be a "+" in front of the PID. That would be
ignored by strtol.
OS: Linux → FreeBSD
Created attachment 190038 [details] [diff] [review]
fix

This patch does that. It works.
Assignee: nobody → roc
Attachment #174189 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #190038 - Flags: superreview?(brendan)
Attachment #190038 - Flags: review?(bryner)
Attachment #190038 - Flags: superreview?(brendan)
Attachment #190038 - Flags: review?(bryner)
Created attachment 190043 [details] [diff] [review]
real fix

Sorry, the previous patch was the wrong patch. This is the right patch.
Attachment #190038 - Attachment is obsolete: true
Attachment #190043 - Flags: superreview?(brendan)
Attachment #190043 - Flags: review?(bryner)
Comment on attachment 190043 [details] [diff] [review]
real fix

As Basil Fawlty would say, BRILLIANT!

Just lose your wacky Java-ish style ;-) and sr+a=me, contingent on bryner's r+.

>-nsresult nsProfileLock::LockWithSymlink(const nsACString& lockFilePath)
>+static PRBool IsSymlinkStaleLock(struct in_addr* aAddr, const char* aFileName,
>+                                 PRBool aHaveFcntlLock) {

Wacky Java-ish style nit-picks: no need for static, and put the { on its own
line in column 1.

>     nsresult                LockWithFcntl(const nsACString& lockFilePath);
>-    nsresult                LockWithSymlink(const nsACString& lockFilePath);
>+    /**
>+     * @param aHaveFcntlLock if true, we've already acquired an fcntl lock so this
>+     * lock is merely an "obsolete" lock to keep out old Firefoxes
>+     */

How about a blank line before this doc-comment?

So the remaining concern for Unixen was the fear, going back to 1995, that some
lockd implementations hang when you try fcntl -- but such gross bugs must have
been fixed by now (right?).  So I'm not concerned about trying fcntl first. 
Randell Jesup, speak up if you are still concerned.

Great work, thanks for cutting the Gordian Knot here.

/be
Attachment #190043 - Flags: superreview?(brendan)
Attachment #190043 - Flags: superreview+
Attachment #190043 - Flags: approval1.8b4+
(In reply to comment #105)
> >-nsresult nsProfileLock::LockWithSymlink(const nsACString& lockFilePath)
> >+static PRBool IsSymlinkStaleLock(struct in_addr* aAddr, const char* aFileName,
> >+                                 PRBool aHaveFcntlLock) {
> 
> Wacky Java-ish style nit-picks: no need for static, and put the { on its own
> line in column 1.

'static' does have an effect. A good compiler would notice that there's exactly
one use of this static function and forcibly inline the whole thing. I think gcc
even does that.

> How about a blank line before this doc-comment?

Sure.
*** Bug 301941 has been marked as a duplicate of this bug. ***
Comment on attachment 190043 [details] [diff] [review]
real fix

Looks good.  Re: the comment about checking whether the running process is
really the right app, it would be great if we could somehow get the window id
for the process and then query it with xremote.  But that's for another day.
Attachment #190043 - Flags: review?(bryner) → review+
checked in
Status: ASSIGNED → RESOLVED
Last Resolved: 13 years ago
Resolution: --- → FIXED

Comment 110

13 years ago
I think we need to add the following codes 
  if (NS_FAILED(rv))
    return rv;
after 
    else if (rv != NS_ERROR_FILE_ACCESS_DENIED)
    {
        // If that failed for any reason other than NS_ERROR_FILE_ACCESS_DENIED,
        // assume we tried an NFS that does not support it. Now, try with symlink
        // using the old symlink path
        rv = LockWithSymlink(oldFilePath, PR_FALSE);
    }

if not, mHaveLock will be set to PR_TRUE even if current mozilla fail to get the
lock.
It's OK to set mHaveLock even if we didn't actually acquire the lock. For
example the Mac code path always set it.

In fact, in this case it would not be correct to skip setting mHaveLock there as
we might have taken the fcntl lock and had some kind of error placing the
symlink lock ... then if mHaveLock is PR_FALSE, the cleanup code would not
release the fcntl lock.

Updated

13 years ago
Blocks: 303633

Comment 112

13 years ago
*** Bug 297969 has been marked as a duplicate of this bug. ***

Comment 113

13 years ago
*** Bug 316203 has been marked as a duplicate of this bug. ***

Updated

10 years ago
Duplicate of this bug: 324637
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.