Closed
Bug 151188
Opened 22 years ago
Closed 19 years ago
Pid-signature on profile lock not unique (was: No option to ignore profile lock), results in "profile is in use" message after crash
Categories
(Core Graveyard :: Profile: BackEnd, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla1.8alpha3
People
(Reporter: BenB, Assigned: roc)
References
Details
(Whiteboard: [cb] no progress for 1.8b3?)
Attachments
(2 files, 3 obsolete files)
6.65 KB,
patch
|
Details | Diff | Splinter Review | |
12.66 KB,
patch
|
bryner
:
review+
brendan
:
superreview+
brendan
:
approval1.8b4+
|
Details | Diff | Splinter Review |
Reproduction:
1. Make it so that the profile is locked, but Mozilla is not running. That
sometimes happens during crashes (e.g. of the machine).
2. Start Mozilla, selecting that locked profile
Actual result:
Mozilla tells you that the profile is locked, because another instance of
Mozilla is using it. That is untrue (there is no other instance). There is no
way in the UI to ignore the lock and start Mozilla with that profile
nevertheless. Without going to the filesystem, you cannot use the profile
anymore at all - it is lost.
Expected result:
1. Mozilla checks, if there is actually another Mozilla instance running.
- If no other instance found
2. guess that the lock is a leftover from an abort.
3. Tell the user so, warning about potential profile corruption and warning
that there might indeed be another Mozilla instance that we just didn't
find, e.g. a Mozilla on another computer, which uses the same profile
on a network-shared folder.
4. Allow to stop/proceed, default proceed.
5. Remove lock, if proceed.
- If we another Mozilla instance found running
2. Pop up the dialog we have today
3. The dialog offers a way to ignore the lock, on the user's risk.
4. Default is to honor the lock and not to proceed.
I see that this is a bit of work. The most important part are the lalst 2 steps
- allowing a way to override the lock in the current dialog.
Workaround:
1. Remove the file "lock" in your profile.
Severity:
The should not be hard to implement, and it easy to work around, but normal
users will not find that workaround, which makes their emails and bookmarks
practically (almost) unaccessible, which is practically major dataloss for them.
Thus, severity critical, although it's surely arguable. In any case, I think
this should be fixed soon - by far worse than a crash IMO.
Additional Comments:
- Might be cross-platform, dunno.
- I surely appreciate that we seem to allow 2 instances of Mozilla with
different profiles at the same time.
Comment 1•22 years ago
|
||
AFAIK after a comment from ccarlen's we will not add an ignore.
(not as pref, hidden pref or something else)
You can file a bug that mozilla should detect that the profile is not in use.
(and not critical, since there is no dataloss, memory leak or crash)
Severity: critical → major
Comment 2•22 years ago
|
||
> Mozilla tells you that the profile is locked, because another instance of
> Mozilla is using it. That is untrue (there is no other instance).
this part is a bug, but I have never seen it (except for bug 145061). Mozilla
checks the pid the lock file claims to have and if it does not exist, ignores /
removes the lock file.
What you're seeing could easily happen over NFS.
see also bug 122698
Reporter | ||
Comment 3•22 years ago
|
||
I saw this at least twice since my machine locks up hard occasionally recently.
I use Mozilla 1.0-code.
No NFS or network shares involved. There definitely was no other Mozilla
instance, because the machine just booted.
> (and not critical, since there is no dataloss, memory leak or crash)
As I outlined above, this bug equals complete dataloss of the whole profile for
normal users without external help.
Severity: major → critical
Reporter | ||
Comment 4•22 years ago
|
||
->backend
(As we speak, it happened again.)
Assignee: ben → ccarlen
Component: Profile Manager FrontEnd → Profile Manager BackEnd
Comment 5•22 years ago
|
||
Sounds like a problem in the fix for bug 76431. We do check that the process
that acquired the lock is alive so crashing while holding the lock should not be
an issue - except perhaps if some other process gets the same process id next
time around. There are also signal handlers installed which should remove the
lock in the event of a crash. I need to check and see whether pid rollover could
be to blame.
Anyway, I don't think there should be an option to ignore the lock. The locking
code needs to avoid getting into this situation and be absolutely fail-safe.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.1beta
Comment 6•22 years ago
|
||
Brendan, check out comment #5. Does process id rollover seem likely? To avoid
this on the Mac, I record in the lock file both the process id *and* the time at
which the process was launched in order to prevent this.
Comment 7•22 years ago
|
||
Ben, when this problem occurs, could you check the lock file? Is it pointing to
a PID that exists?
Comment 8•22 years ago
|
||
I do not believe PID rollover can happen on modern Unixes.
Someone needs to dig a little more here, diagnose the problem, and not call for
a bogus solution (a way to ignore the lock).
/be
Reporter | ||
Comment 9•22 years ago
|
||
> could you check the lock file? Is it pointing to a PID that exists?
Yes, it was in at least 2 instances.
> I do not believe PID rollover can happen on modern Unixes.
It is even likely, if you consider the case in question (machine crash, reboot).
PIDs are assigned sequentially, so it is very likely that processes started
during the boot get the same PID as the last time. Extreme case: init is always
PID 1. Similarily, if the user always starts Mozilla directly after the login.
BTW: I notice that the lock is not a file, but a symlink to "<ip>:<pid>". Not
sure, if that's legal, but it's unusual - usually, PID files have the PID as the
(only) content (see /var/run/).
Comment 10•22 years ago
|
||
Win98, Mozilla 1.1a:
Mozilla crashed several times when trying to get Enigmail working (usually IPC
console freeze/crash) which eventually ended by killing the process (winoldap),
killing Mozilla simultaneously. After several of these types of crashes, I saw
this bug. When launching Mozilla, the profile manager comes up asking me to
select a profile (w/ only one profile to select from) and when I click "Start
Mozilla", I get the following Alert:
"Mozilla cannot use the profile "earthsound" because it is in use.
Please choose another profile or create a new one."
I have no other instance of Mozilla running. I will see if rebooting the box
will do any good and get back to this bug...
Comment 11•22 years ago
|
||
After a reboot, Mozilla launched incident free.
Comment 12•22 years ago
|
||
i'm nearly certain that i had pid rollover in FreeBSD 4.4 while i was at
worldgate. otoh my box was up for nearly my entire term there (no reboots, no
panics, no kernel updates, no power failures).
QNX seemed to use 10digit numbers for pids and whether it actually could roll
over would probably not matter since it didn't seem likely that randomness
would give you a reasonable chance to hit the same pid.
ignoring locks would be stupid, but having a capture/replace lock option might
be reasonable.
Comment 13•22 years ago
|
||
*** Bug 145363 has been marked as a duplicate of this bug. ***
Comment 14•22 years ago
|
||
*** Bug 157223 has been marked as a duplicate of this bug. ***
Comment 15•22 years ago
|
||
> I do not believe PID rollover can happen on modern Unixes.
As a note, just building mozilla a few times will lead to PID rollover on a
typical RedHat Linux install running on the Intel PC arch.... (yay all those
shell and make processes).
Comment 16•22 years ago
|
||
Are multiple instances of Mozilla on DIFFERENT machines sharing a single home
directory via NFS simply not allowed?
If so, the locking design needs to be seriously re-examined..
Comment 17•22 years ago
|
||
> Are multiple instances of Mozilla on DIFFERENT machines sharing a single home
> directory via NFS simply not allowed?
Absolutely. The bug I think you're looking for is bug 135137
Comment 18•22 years ago
|
||
*** Bug 163854 has been marked as a duplicate of this bug. ***
Comment 19•22 years ago
|
||
I still don't think any UI should let users override the lock. If we had to do
that, we'd be in a situation where non-experts would choose "override" out of
ignorance. Experts can rm. This bug should be about qualifying the pid-sig in
some backward compatible way.
Linux kernel helpers (shaver!), is there no way to get the guaranteed-unique, 32
bit pid from the system?
/be
Summary: No option to ignore profile lock → Pid-signature on profile lock not unique (was: No option to ignore profile lock)
Comment 20•22 years ago
|
||
> This bug should be about qualifying the pid-sig in some backward compatible way.
I agree completely. Can we just encode the launch time of the process which
acquired the lock into the string in addition to the PID? That way, we can parse
for two tokens, and if we find only one, it's an older app without this change
and we just have to believe the lock is valid. All we need is, given a PID, find
out the time at which that process was launched. Anybody know how to do this?
Reporter | ||
Comment 21•22 years ago
|
||
ccarlen, I am not sure what you are talking about. This bug is about the fact
that my system locked up and Mozilla left a lock in the profile. After reboot, I
tried to start Mozilla again and I couldn't, because the lock prevented it.
So, this has absolutely nothing to do with PID rollover (which means - I infer
from your discussion - the same PID for 2 different processes *during the same
OS session*). PID rollover could just be another, very unlikely, case where the
same result appears.
I don't think that my case is an "expert" situation. Computers can crash hard
for newbies as well. Consequently, we should not rely on expert knowledge to
recover from that situation.
Comment 22•22 years ago
|
||
> I am not sure what you are talking about
> encode the launch time of the process
in addition to recording the PID of the mozilla process, record the start time.
this would resolve PID rolover AND PID re-use after reboot.
> Anybody know how to do this?
I think this would be very platform-specific. On Linux, you could look in
/proc/(PID)
Comment 23•22 years ago
|
||
> So, this has absolutely nothing to do with PID rollover
I think it must. The only way that this could happen is that mozilla left a lock
file, the system crashed and was rebooted and, next time around when mozilla
checked the PID of the lock file, some other process now had that PID. Knowing
the time at which the process with that ID was launched would allow us to detect
this.
Reporter | ||
Comment 24•22 years ago
|
||
OK, and what do you do in that case? You see "There's a lock, and it's not from
me." What then?
Comment 25•22 years ago
|
||
If the launch time enocoded in the lock does not match the launch time of the
now running process with the PID which is encoded in the lock, you know the lock
is invalid and claim it.
Comment 26•22 years ago
|
||
*** Bug 156836 has been marked as a duplicate of this bug. ***
Comment 27•22 years ago
|
||
*** Bug 156293 has been marked as a duplicate of this bug. ***
Comment 28•22 years ago
|
||
*** Bug 183903 has been marked as a duplicate of this bug. ***
Comment 29•22 years ago
|
||
*** Bug 189891 has been marked as a duplicate of this bug. ***
Comment 30•22 years ago
|
||
Setting to milestone that's not passed.
Target Milestone: mozilla1.1beta → mozilla1.4beta
Comment 32•22 years ago
|
||
> Setting All/All per comment 26.
Yeah, but now that the change mentioned in that comment has been made, it's no
longer All/All.
OS: All → Linux
Hardware: All → PC
Comment 33•22 years ago
|
||
Mozilla 1.3b, Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3b)
Gecko/20030210, Windows 2000
I grabbed a fresh install of Mozilla 1.3b and ran the GRE installer while
Mozilla was running in it's first instance.
The GRE installer prompted me to shut down the running instance of mozilla,
which I agreed to, and the GRE installer proceeded to shut down the exisiting
instance after notifying me about Quicklaunch.
After GRE installation completed, I relaunched Mozilla and was greeted by the
profile manager. The Default profile was selected and clicked Start Mozilla.
Instead of Mozilla, I was surprised by an alert which read:
"Mozilla cannot use the profile "default" because it is in use.
Please choose another profile or create a new one."
Mozilla's Quicklaunch was not running when I tried the 2nd launch and a Mozilla
wasn't visible within the Windows environment, nor was Mozilla listed in Task
Manager>Applications. I could only start Mozilla after I killed the mozilla.exe
process still listed in Task Manager>Processes.
I will uninstall and see if I can repeat this. Would any patch applied toward
this bug touch upon this manifestation of this bug, or should I file a new one?
Reporter | ||
Comment 34•22 years ago
|
||
> I could only start Mozilla after I killed the mozilla.exe
> process still listed in Task Manager>Processes.
earthsound, what you saw is a different bug. In your case, there actually was a
Mozilla still running. In my case, there were none, but the lock was
nevertheless there.
Comment 35•22 years ago
|
||
*** Bug 193136 has been marked as a duplicate of this bug. ***
Comment 36•22 years ago
|
||
*** Bug 198255 has been marked as a duplicate of this bug. ***
Comment 37•22 years ago
|
||
*** Bug 203733 has been marked as a duplicate of this bug. ***
Comment 38•22 years ago
|
||
My bug (bug #203733) got closed as a dup of this one. I would encourage
the folks working on this bug to look at the comments I put there. I'm
not sure that it is truly a duplicate, as the comments in this bug report
seem to state that this is about fixing PID-related problems.
To summarize bug #203733: the fact that the IP address is recorded
seems to cause another variant of the PID-wrapping/PID-replay problem
documented here. This is my reading of the code at
http://lxr.mozilla.org/mozilla/ident?i=USE_SYMLINK_LOCKING:
2099 if (addr != inaddr.s_addr)
2100 {
2101 // Remote lock: give up even if stuck.
2102 break;
2103 }
Timestamps don't solve the networking problem I've outlined. Users
with profiles on NFS servers are a fact of life, and users migrating
between client nodes is a part of that. Even if you claim "no we don't
support that", DHCP is *also* a fact of life, and the existing scheme
doesn't seem to work on a host whose IP address changes frequently, as
is the case with many laptop users.
When the IP addresses conflict, I don't see how you can reasonably
fix this without a UI tweak. Perhaps you could only activate the UI
option if the code detects that the local and lock IP addresses are
different; this would alleviate some of the non-expert vs. expert fears.
More generally, I disagree with Brendan's comments about experts; case
in point: I got a call from my manager, who is not an "expert", but
who ran into this problem. It's not her fault. This bug then caused
me to spend time on figuring out and fixing the problem. How is that
a good user experience for either her or me? No one likes having to
call up the local nerd to fix his/her browser.
Thanks for working on this, I'm glad to see it being addressed.
Comment 39•22 years ago
|
||
1. ping - a proposal to handle ip address changes in a network environment.
when we examine the lock file and determine that the ip address doesn't match
the local computer, try pinging the referenced computer.
if it's down, we ass-u-me that the lock is bogus.
this could fail for a worldwide nfs server where someone uses a profile in one
place, and then moves to another corporation which blocks icmp. but i don't
think that would be very likely.
2. readonly access - an offtopic proposal to handle annoyed users
(this is yet another reason why it might be useful to support readonly access to
profiles.)
Instead of telling users that they can't use a profile when we can't get a lock,
tell the user that their changes will be lost, but let the user use the profile.
i believe n4 did this.
Comment 40•22 years ago
|
||
*** Bug 203832 has been marked as a duplicate of this bug. ***
Comment 41•22 years ago
|
||
timeless wrote:
> 1. ping - a proposal to handle ip address changes in a network environment.
> when we examine the lock file and determine that the ip address doesn't match
> the local computer, try pinging the referenced computer.
Erm... what will happen in the case of DHCP ? Just guess, please... =:-)
Comment 42•22 years ago
|
||
I'm sorry, I don't think the ping proposal is really viable; it certainly
doesn't solve the university case, where a student is moving about from
host to host in a computing cluster. In particular: computer A crashes
while student is running mozilla. So she moves to computer B and then
gets this message, since computer A still "holds" the lock.
At the very least, you could provide the user the hint:
This profile is locked by a copy of mozilla running on the remote
computer "foobar" (PID 12345). If you are SURE that it is not in
use, you may need to remove the file
"/home/<user>/.mozilla/5t1x4ov6.slt/lock" and restart mozilla.
Comment 43•21 years ago
|
||
*** Bug 215257 has been marked as a duplicate of this bug. ***
Comment 44•21 years ago
|
||
some thoughts:
1. Mozilla should still try to tell whether a lock is bogus or not even
knowing in some cases it couldn't at all. Once a lock is determined to
be bogus, it should be cleared without know of the user. ( In local host
case we could tell even a pid wrap happens)
2. Since mozilla never know whether a lock is bogus or not in the cases
of nfs mounted home directory and such case is very common in enterprise
environment, means should be provided to allow the user to delete the
lock if he is sure it is safe. If we don't do this, such users get very
bad user experience.
3. About what means provided to user. a message saying something like
" An instance of mozilla running on "Hostname", PID=12345, is using this
profile, if you are very sure it's not in use any more, you need to
remove the file /home/..../.mozilla/xxxxxx/lock" and restart mozilla".
Providing an button which directly clear the lock file may be dangerous,
allowing for many users is likely to press a button before reading
carefully the alarms and the severe results of running more than one
mozillla on same profile.
4. Is there a resolvement to know whether a lock is alive or not in
network environment? how about heart beat timestamp? although it has its own
problems too
Comment 45•21 years ago
|
||
*** Bug 225157 has been marked as a duplicate of this bug. ***
Comment 46•21 years ago
|
||
I would have really liked Dan's "Hint" message (comment 42) when I suffered this
problem yesterday (Linux hard crash, machine rebooted). Although the lock
symlink was broken, and therefore drawing a lot of attention in my colorful "ls
-al" command, I felt quite nervous about removing this file with no easily-found
documentation.
IMHO, Dan's suggested text has two problems:
(1) His message text states that it is a remote computer; this assumption may
not be valid. But, if DHCP has assigned a new address to the local machine in my
scenario, we can't use the address to confirm that it is local, either. Just
list the machine addr, don't pin it down either way.
(2) The phrase "you may need to" is too coy: If you are SURE that the profile
isn't in use, and that it is the one you want to use, then "you should" remove
the file.
If the Severity is "critical", I feel that we should at least provide
documentation of the work-around in the pop-up. Struggling to create code which
handles duplicate PIDs and DHCP addresses seems much less important to me. The
bug has been open for 15 months!
Please advise if a separate bug is approriate, although I'm not very experienced
with creating these things.
BTW, if this bug summary contained the words "in use" rather than "lock", I
would have found it. The current Profile Manager error message *DOESN'T CONTAIN*
the word "lock".
Comment 47•21 years ago
|
||
*** Bug 226598 has been marked as a duplicate of this bug. ***
Comment 48•21 years ago
|
||
*** Bug 228064 has been marked as a duplicate of this bug. ***
Comment 49•21 years ago
|
||
[ I am submitter of bug #228064 in which I described a desired behavior close to
comment #44 ]
Just adding a comment to insist on the fact that this issue is *critical* for a
beginner.
I posted this issue after someone sent me a mail saying :
"I'm writting to you because I have a BIG problem with Mozilla.
Yesterday, opening it, it asked me something like chose/configure the account to
use. [...] I don't know what I've done, but now, I can't use my "account" (my
old mails, addressbook, and even my bookmarks) anymore"
So, please
* Add a message explaining the workaround in the *next* version of Mozilla (This
is trivial, so, should be done *right now*).
* Then, it becomes less critical, but it would still be really good to offer an
option to delete the file. (keep in mind that ~/.mozilla is hidden, so,
unreachable for a newbie :-( ... )
Comment 50•21 years ago
|
||
I got a new idea, but not sure my assumptions are right.
I think heart beat is the only way to solve this problem.
1.A mozilla instance must periodly(say every 5 minutes) claims that it owns a
lock by touching lock file( and write some stuff in it if we need).
2. if a lock has not been claimed for 5minutes(read it and then compare its last
access time and last change time), the lock is thought to be bogus.
assumptions:
1. on a net fs environment, the access/visit time of file is always from the
net fs server.
2....
Comment 51•21 years ago
|
||
I just tried to move my mozilla profile between an old machine and my new one,
and got bitten by this. I simply copied /home/rjn/.mozilla from one to the
other, and for some reason, there was still a stale lock file in place.
I agree with #42 - the profile manager should give more information in the error
message, so that the user at least knows why there is an "erroneous error".
Comment 52•21 years ago
|
||
*** Bug 234199 has been marked as a duplicate of this bug. ***
Comment 53•21 years ago
|
||
*** Bug 235216 has been marked as a duplicate of this bug. ***
Comment 54•21 years ago
|
||
*** Bug 239615 has been marked as a duplicate of this bug. ***
Comment 55•21 years ago
|
||
The summary here is strange: "Pid-signature on profile lock not unique",
The problem described in this bug happens with unique locks as well.
I had a system crash yesterday with a 1.6 build running, and after reboot a
stale lock was left around. Mozilla did not want to start while the lock existed.
The PID number indicated by the lock was way higher than the PID would have been
had i managed to start Mozilla anew. In the previous session, where it crashed
and left the stale lock, Mozilla and several other processes had been started
and terminated for weeks on end. The PID of the lock was 27000-something.
Comment 56•21 years ago
|
||
(In reply to comment #55)
> I had a system crash yesterday with a 1.6 build running, and after reboot a
> stale lock was left around. Mozilla did not want to start while the lock existed.
> The PID number indicated by the lock was way higher than the PID would have been
> had i managed to start Mozilla anew. In the previous session, where it crashed
> and left the stale lock, Mozilla and several other processes had been started
> and terminated for weeks on end. The PID of the lock was 27000-something.
what confuses Mozilla is if the PID of the lock file corresponds to an existing
process (any existing process). If you had no existing process using the lock's
PID, then it's a different (and more serious) bug.
Reporter | ||
Comment 57•21 years ago
|
||
Andrew, rkaa said her PID file was for a PID 27000-something, and she had the
problem after reboot. I don't think Linux assigns a PID of 27000 right after
booting.
rkaa and the rest: See comment 9 and 21. My problem was (IIRC) that you can have
a stale, invalid lock pointing to the PID of your own and only Mozilla process,
and it's even a likely case.
Start computer
Start Mozilla right after boot (automatically or manually)
Machine crashes hard
Reboot
Start Mozilla right after boot (automatically or manually)
In that case, it's likely that you run into this bug. Mozilla will likely leave
a stale lock behind during the crash, and the Mozilla instance after reboot is
likely to get the exactly same PID as during the last boot. No need for "PID
rollover" or anything freaky or otherwise statistically unlikely.
And, this bug means massive dataloss for the average user. He/she can't get to
the profile, including stored emails, *at all* anymore. Mozilla will refuse to
use it, even after re-installing Mozilla. Only after searching the net on user
help sites will he/she ideally find the hint what to do. So, if the average user
who runs into this problem either says goodbye to *all* Mozilla data or is up
for major trouble and time-sinks.
And it would be so easy to fix...
Reporter | ||
Comment 58•21 years ago
|
||
Please disregard my last comment. While it's a valid problem on its own, I must
have run into other problems as well. Sorry, too late here.
Comment 59•21 years ago
|
||
I was going try implementing fcntl file locking (the "correct" way to do profile
locking), but the code is already there. It's just avoided in deference to old
NFS servers, or those not running lockd. The "old" NFS servers would be those
from 2.0 kernels (user-space NFS daemon). Early 2.2 kernels were apparently
less than optimal as well.
Has enough time passed that we can depend on people:
1) running on their local machine
2) running with newer NFS (we're two major kernel versions beyond "old" kernels)
Is it possible to detect such old nfs implementations and just use symlinks then?
No longer blocks: 189595
Comment 60•20 years ago
|
||
Please refer bug 253950 regarding a proposed change in the UI behaviour that
would slightly alleviate the impact of this issue in some cases. I submitted
that bug for Firefox but Mozilla's behaviour is analogous.
Comment 61•20 years ago
|
||
*** Bug 246077 has been marked as a duplicate of this bug. ***
Comment 62•20 years ago
|
||
*** Bug 208162 has been marked as a duplicate of this bug. ***
Comment 63•20 years ago
|
||
*** Bug 261106 has been marked as a duplicate of this bug. ***
Comment 64•20 years ago
|
||
*** Bug 264661 has been marked as a duplicate of this bug. ***
Comment 65•20 years ago
|
||
Tweaking summary to hopefully catch some dupes. Sorry for spam.
Summary: Pid-signature on profile lock not unique (was: No option to ignore profile lock) → Pid-signature on profile lock not unique (was: No option to ignore profile lock), results in "profile is in use" message after crash
Comment 66•20 years ago
|
||
*** Bug 257306 has been marked as a duplicate of this bug. ***
Comment 67•20 years ago
|
||
*** Bug 264793 has been marked as a duplicate of this bug. ***
Comment 68•20 years ago
|
||
it seems to me like the real problem was fixed in bug 265313, and was not caused
by a reused pid.
Comment 69•20 years ago
|
||
I don't think so. There's no indication that locking is broken per se. FWIW,
the lock is subsequently (a day or two) released.
Reporter | ||
Comment 70•20 years ago
|
||
No, biesi, I don't think so. This bug was observed in a case where
*******
the machine crashes (and so the lock did not get removed) and after a reboot of
the machine, the new Mozilla instance got exactly the same PID as before reboot
- several times.
*******
So, the PID in the lock file will point to a valid, running Mozilla (in fact the
checking Mozilla itself in this case, but not necessarily), but it's still not
the same process that created the lock file, the latter process crashed before
reboot.
Here's how it can happen (I already mentioned it in comment 9, but I'll try with
more words): Look at |ps ax|. Note that the processes (sorted by PIDs) appear in
the order they were started. The processes started automatically from initd
during boot will come first. If you write down their PIDs, reboot, do the same
again, you may find that they have exactly the same PIDs again, although you
rebooted in the meantime, so they are definitely different processes. Whether
the PIDs do happen to match exactly, may depend on your OS (kernel; how boot
scrips are implemented, e.g. if it runs a |rm| separately over each of the files
in /tmp/, whose number will vary). But *if* PIDs are assigned sequentially
without randomness (which I think is the case), and *if* exactly the same
processes get started in the same order after reboot (which is also not unlikely
due to initd and co), it's likely that your syslog, apache and kdm and maybe
even the GUI apps started by KDE session manager or another method all have the
same PIDs they had after your last boot. At this point you already have a
problem, if you have an old lock file lying around.
Now add to the equation a machine that has hardware problems, thus crashes hard
very often. This makes the problem more likely to appear. Mozilla will not have
had a chance to remove the lock file (nothing happened anymore, not even signal
handlers or the kernel), and you perform the same boot sequence a lot of times
(because of a lot of crashes in a row), *and* due to the kernel crashing with
the machine, you as a user *assume* that there is dataloss, so it's not all that
suspicious when Mozilla tells you that you can't use your profile anymore, so
you (as a by now terribly unnerved and unpatient end user) are more likely to
write it off and start a new one, although your profile might have been just fine.
This was my case (to my knowledge), so I am not constructing this. Granted, the
machine crash is a hazardous environment to live in, but this is an unnecessary
problem that Mozilla adds, and it can happen in other cases, too.
I'm sure I'm not the only one to start Mozilla immediately after boot. In this
case, the current *idea* of profile locking doesn't work as expected and needs
to be adapted.
Proposal:
If the PID in the lock file exists
- check that it's a Mozilla (don't check for process being called "mozilla",
because it may be called "netscape" or "beonex-comm" or whatever, but use other
methods, e.g. try to communicate with it). If it's not a Mozilla, ignore the
lock file (or allow to override it)
- check that it's not yourself (checking process). If it is, ignore the lock
file. (I'm assuming that every Mozilla process has at most one profile in use.)
Maybe there's a better fix, though, e.g.
Alternate Proposal:
- if the PID in the lock is running, try to communicate with that Mozilla and
ask it, which profile directory it's using currently. Only lock the profile, if
the other Mozilla tells you that it's using the same that you are trying to use.
If the communication fails, because it's not a Mozilla (or the Mozilla doesn't
react anymore), ignore the lock. If you get a such request (e.g. from yourself),
answer 'no profile'.
I still think that a user override (with warnings) is probably a good idea,
until the above gets implemented (this bug is now 2.5 years old) and in general
for other bugs, because no code is infallible. Do not assume that the user
searches on Google about how to remove the lock file.
Comment 71•20 years ago
|
||
simpler, just make sure the lock is younger than the pid. if the lock is older
than the pid, then it's bogus.
Comment 72•20 years ago
|
||
timeless, that's what I was talking about back in comment 20. I don't currently
have a Linux box to pursue that fix, but if somebody does, I still think it's
the way to go.
Comment 73•20 years ago
|
||
This adds a timestamp to the lock symlink's content, which can be compared to
boot time on systems with sysinfo(2) or something like it. Porting help
needed!
/be
Attachment #170712 -
Flags: superreview?(bryner)
Attachment #170712 -
Flags: review?(ccarlen)
Comment 74•20 years ago
|
||
Comment on attachment 170712 [details] [diff] [review]
proposed improvement
As-is, this will fail to compile on Mac, and probably some other systems as
well. We may want an autoconf test for sysinfo() if it does exist on systems
other than Linux. On Mac, I believe you'll want to use sysctl() with
CTL_KERN/KERN_BOOTTIME to get the system start time. Dunno if other systems
have that sysctl (BSD?)... if so, maybe we should have an autoconf test for
that as well.
Attachment #170712 -
Flags: superreview?(bryner) → superreview-
Comment 75•20 years ago
|
||
Comment on attachment 170712 [details] [diff] [review]
proposed improvement
Ok, I'll work with a Mac buddy (or steal the wife's laptop and install tools)
to get a Mac-tested patch here. Comments and porting interdiffs still welcome.
/be
Attachment #170712 -
Flags: superreview-
Attachment #170712 -
Flags: review?(ccarlen)
Comment 76•20 years ago
|
||
This ought to work on OS X -- testing welcome (haven't requisitioned the wife's
G4 powerbook yet! ;-).
/be
Assignee: ccarlen → brendan
Attachment #170723 -
Flags: superreview?(bryner)
Attachment #170723 -
Flags: review?(ccarlen)
Comment 77•20 years ago
|
||
Comment on attachment 170723 [details] [diff] [review]
proposed improvement, v2
>--- nsProfileLock.cpp 12 Nov 2004 19:26:37 -0000 1.7
>+++ nsProfileLock.cpp 9 Jan 2005 05:47:15 -0000
>@@ -49,16 +49,17 @@
> #endif
>
> #ifdef XP_UNIX
> #include <unistd.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <stdlib.h>
>+#include <sys/sysinfo.h>
This will fail to compile on Mac. I'm not really in agreement with the
assumption that XP_UNIX means that sysinfo will be available. This
functionality should be in NSPR or at the very least should be an autoconf
test.
>+#if defined(XP_MACOSX)
>+ int mib[2];
>+ size_t size;
>+ struct timeval boottv;
>+
>+ mib[0] = CTL_KERN;
>+ mib[1] = KERN_BOOTTIME;
>+ size = sizeof(boottime);
boottime is not declared anywhere. Assuming you meant to use 'boottv'.
This could be condensed some...
int mib[2] = { CTL_KERN, KERN_BOOTTIME };
struct timeval boottv;
size_t size = sizeof(boottv);
Attachment #170723 -
Flags: superreview?(bryner) → superreview-
Comment 78•20 years ago
|
||
Comment on attachment 170723 [details] [diff] [review]
proposed improvement, v2
Have to get to a Mac tomorrow and test things first. Also need some autoconf
101 help.
/be
Attachment #170723 -
Attachment is obsolete: true
Attachment #170723 -
Flags: review?(ccarlen)
Updated•20 years ago
|
Priority: -- → P1
Comment 79•20 years ago
|
||
Ok, I'm bogging. My patch needs autoconf love and I'm swamped with other,
higher priorities. Anyone willing to take this into the end zone?
/be
Comment 80•20 years ago
|
||
is there a reason not to implement comment 59?
Comment 81•20 years ago
|
||
Comment 59 wasn't a concrete implementation plan, unless you ignore the final
sentence: "Is it possible to detect such old nfs implementations and just use
symlinks then?"
If you have an answer to that question, lay it on us.
If you just wanna go for it and require working NFS locking, I'd love to see
that work too -- but it would be better to do that in an early alpha. Otherwise
who is to say that the number of users exposed to broken NFS locking won't be
too high for us to live with?
/be
Comment 82•20 years ago
|
||
actually, looking at the source again, the Mac implementation already falls back
to symlinks, specifically because of old NFS. The fcntl man page says it should
return ENOLCK if it hits an old NFS implementation (consistent with
http://www.linuxhq.com/lnxlists/linux-kernel/lk_9905_02/msg00150.html), which
would mean the current use in Mozilla would work properly and fall back to
symlinks. The comment in the source says:
http://lxr.mozilla.org/seamonkey/source/profile/dirserviceprovider/src/nsProfileLock.cpp#548
// First, try the 4.x-compatible symlink technique, which works with NFS without
// depending on (broken or missing, too often) lockd.
There seem to be 3 reasons stated there: NS4 compatibility, missing lockd and
broken lockd. I don't see NS4 compatibility as relevant anymore and missing
lockd should trigger a fallback to symlinks. The lockd would need to be not
only old, but broken (claiming to support remote locking, but not).
Comment 83•20 years ago
|
||
Ok, how about a patch to use the XP_MAXOSX code everywhere? We'd still need
compile-time tests for fcntl and its alternatives.
/be
Comment 84•20 years ago
|
||
I have this working, but with one problem:
If an newer (fcntl'ing) build of Mozilla is running and an older (symlinking)
build of Mozilla tries to use the same profile, it will succeed in creating the
symlink lock file and start. As currently implemented, the symlink gets a
different filename than the fcntl filename, but even if they were the same, the
symlinking method just deletes anything that's not a symlink.
I can add code to each method to not clobber the other type of lock, but that
won't prevent older builds from doing it. I guess it could try to get both
types of lock file, but that seems a bit overboard.
Reporter | ||
Comment 85•20 years ago
|
||
Just FYI, another, far simpler and more obvious way to run into this bug (I just
did): After hard reboot, I have a
$ ls -la lock
lrwxrwxrwx 1 browser users 16 2005-01-25 15:21 lock -> 192.168.3.1:3522
$ ps ax|grep 3522
3522 ? S 0:00 kdeinit: kio_file file /mnt/ram/tmp/ksocket-ben/klaunch
(192.168.3.1 is my IP address.) This prevented Mozilla 1.7.x from using the
profile. So, no need for complicated situations like in my comment 70 or PID
rollover to reproduce this.
I don't know what "fcntl file locking" is, so I can't say, if it's important.
Updated•20 years ago
|
Flags: blocking1.8b2?
Flags: blocking-aviary1.1?
Comment 86•20 years ago
|
||
Updated•20 years ago
|
Attachment #174189 -
Flags: review?(brendan)
Comment 87•20 years ago
|
||
If you enable fcntl based locking (which I think is a good idea), then you also
need to write out the symlink file for backwards compat. We must do this so
long as both new and old versions of our apps locate the profile in the same way.
Comment 88•20 years ago
|
||
right (see comment 84). But if we still create a symlink file, it will still
get left behind and this bug will still exist.
Is it really likely for people to have two different versions of the same app
installed and try to use them simultaneously?
Comment 89•20 years ago
|
||
> right (see comment 84). But if we still create a symlink file, it will still
> get left behind and this bug will still exist.
Yes. That is why we need to have better UI when that problem occurs. We need
to tell the user that Firefox appears to already be running, and then we need to
give them the option to continue anyways (asking them to verify that Firefox is
really not already running before proceeding). Showing the "select profile"
dialog is just awful UI.
> Is it really likely for people to have two different versions of the same app
> installed and try to use them simultaneously?
Yes. Suppose someone has Firefox 1.0 installed on their system using their
distro's RPMS, and then suppose they also downloaded a fresh new copy of Firefox
1.1 from Mozilla.org. Now, suppose for whatever reason Firefox 1.1 is not quite
setup with their system properly, so clicking http:// links in other products
like GAIM causes the system version of Firefox to run. If they had Firefox 1.1
running at the time, they'd see a Firefox 1.0 window popup, and maybe they
wouldn't think anything of it. They might use it, and in the process they may
end up trashing their profile. This scenario doesn't seem like an edge case to
me. (NOTE: we had this problem when we first started doing symlink locking for
Mozilla 1.0. Netscape 6 would trash the Mozilla 1.0 profile IIRC.)
Comment 90•20 years ago
|
||
one more thing: if symlink locking is a failover for when fnctl based locking
fails, then we need to solve the symlink problem anyways, so I don't see
supporting symlink locking for backwards compatibility as being much more work.
Reporter | ||
Comment 91•20 years ago
|
||
> clicking http:// links in other products
> like GAIM causes the system version of Firefox to run
Shouldn't the firefox starter binary find a running instance of Firefox,
regardless of version (but regarding app), so starting Firefox 1.0 should just
open a new window in the running Firefox 1.1?
Comment 92•20 years ago
|
||
yes, good call. i forgot about the fact that X-remote is now properly
integrated into the firefox shell script! ;-)
hmm... that may indeed be a convenient way around the problem i was describing.
can anyone think of a way that X-remote might be defeated?
Comment 93•20 years ago
|
||
It's not the binary that looks for running instances but the startup script. If
you pass any arguments to the script (like -inspector, --sync or -P) it will not
try to do X-remote. The startup script ought to handle some of them, but it
doesn't.
Comment 94•20 years ago
|
||
and Seamonkey script, which doesn't try to do X-remote at all (such intuitive
behavior is undesired)
Comment 95•20 years ago
|
||
(In reply to comment #87)
> If you enable fcntl based locking (which I think is a good idea), then you also
> need to write out the symlink file for backwards compat.
I have a suggestion for how to implement this: create the symlink (and remove it
on exit) but give the file it points to a new name, e.g., fcntl.lock.
Then, if the symlink already exists and points to fcntl.lock you know the other
instance supports fcntl locking and can use it to determine whether the lock is
still valid. If the symlink points to another file, the other instance is an
old version and you need to revert to the old behaviour.
This way you maintain backwards compatibility but still get the improved
behaviour once all instances have the new code.
Harry.
Updated•20 years ago
|
Flags: blocking-aviary1.1? → blocking-aviary1.1+
Updated•20 years ago
|
Flags: blocking1.8b2? → blocking1.8b3+
Target Milestone: mozilla1.4beta → mozilla1.8alpha3
Comment 96•19 years ago
|
||
Reassigning to Josh per Deer Park meeting. Thanks Josh!
Assignee: brendan → joshmoz
Status: ASSIGNED → NEW
Comment 97•19 years ago
|
||
seawood, can you take a look at this with us?
Updated•19 years ago
|
Whiteboard: [cb] no progress for 1.8b3?
Updated•19 years ago
|
Flags: blocking1.8b3+ → blocking1.8b3-
Comment 98•19 years ago
|
||
*** Bug 301529 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 99•19 years ago
|
||
Darin's backward-compatiblity argument doesn't apply if the RPM distributor is
the one who applies this patch :-).
Assignee | ||
Comment 100•19 years ago
|
||
A variant of comment #95 would work well, I think. If the link is to something
like "192.168.199.3:7777:obsolete" then today's Firefox builds would treat it
exactly the same --- a valid lock, if pid 7777 exists or the IP address is
different. But future Firefox builds could recognize that this was not the real
lock, that the process that created it made the real lock using fcntl, and if
the fcntl lock is acquired then the symlink is a stale lock and can be deleted.
Comment 101•19 years ago
|
||
http://lxr.mozilla.org/seamonkey/source/profile/dirserviceprovider/src/nsProfileLock.cpp#301
301 char *after = nsnull;
302 pid_t pid = strtol(colon, &after, 0);
303 if (pid != 0 && *after == '\0')
with "192.168.199.3:7777:obsolete", |*after| will be ':' (the second colon), the
link will be assumed to be bogus and deleted. Any such flag would need to be
more subtle (perhaps setting the date to 1970).
Assignee | ||
Comment 102•19 years ago
|
||
Good point.
We could make the obsolete flag be a "+" in front of the PID. That would be
ignored by strtol.
OS: Linux → FreeBSD
Assignee | ||
Comment 103•19 years ago
|
||
This patch does that. It works.
Assignee: nobody → roc
Attachment #174189 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #190038 -
Flags: superreview?(brendan)
Attachment #190038 -
Flags: review?(bryner)
Assignee | ||
Updated•19 years ago
|
Attachment #190038 -
Flags: superreview?(brendan)
Attachment #190038 -
Flags: review?(bryner)
Assignee | ||
Comment 104•19 years ago
|
||
Sorry, the previous patch was the wrong patch. This is the right patch.
Attachment #190038 -
Attachment is obsolete: true
Attachment #190043 -
Flags: superreview?(brendan)
Attachment #190043 -
Flags: review?(bryner)
Assignee | ||
Updated•19 years ago
|
Attachment #174189 -
Flags: review?(brendan)
Comment 105•19 years ago
|
||
Comment on attachment 190043 [details] [diff] [review]
real fix
As Basil Fawlty would say, BRILLIANT!
Just lose your wacky Java-ish style ;-) and sr+a=me, contingent on bryner's r+.
>-nsresult nsProfileLock::LockWithSymlink(const nsACString& lockFilePath)
>+static PRBool IsSymlinkStaleLock(struct in_addr* aAddr, const char* aFileName,
>+ PRBool aHaveFcntlLock) {
Wacky Java-ish style nit-picks: no need for static, and put the { on its own
line in column 1.
> nsresult LockWithFcntl(const nsACString& lockFilePath);
>- nsresult LockWithSymlink(const nsACString& lockFilePath);
>+ /**
>+ * @param aHaveFcntlLock if true, we've already acquired an fcntl lock so this
>+ * lock is merely an "obsolete" lock to keep out old Firefoxes
>+ */
How about a blank line before this doc-comment?
So the remaining concern for Unixen was the fear, going back to 1995, that some
lockd implementations hang when you try fcntl -- but such gross bugs must have
been fixed by now (right?). So I'm not concerned about trying fcntl first.
Randell Jesup, speak up if you are still concerned.
Great work, thanks for cutting the Gordian Knot here.
/be
Attachment #190043 -
Flags: superreview?(brendan)
Attachment #190043 -
Flags: superreview+
Attachment #190043 -
Flags: approval1.8b4+
Assignee | ||
Comment 106•19 years ago
|
||
(In reply to comment #105)
> >-nsresult nsProfileLock::LockWithSymlink(const nsACString& lockFilePath)
> >+static PRBool IsSymlinkStaleLock(struct in_addr* aAddr, const char* aFileName,
> >+ PRBool aHaveFcntlLock) {
>
> Wacky Java-ish style nit-picks: no need for static, and put the { on its own
> line in column 1.
'static' does have an effect. A good compiler would notice that there's exactly
one use of this static function and forcibly inline the whole thing. I think gcc
even does that.
> How about a blank line before this doc-comment?
Sure.
Comment 107•19 years ago
|
||
*** Bug 301941 has been marked as a duplicate of this bug. ***
Comment 108•19 years ago
|
||
Comment on attachment 190043 [details] [diff] [review]
real fix
Looks good. Re: the comment about checking whether the running process is
really the right app, it would be great if we could somehow get the window id
for the process and then query it with xremote. But that's for another day.
Attachment #190043 -
Flags: review?(bryner) → review+
Assignee | ||
Comment 109•19 years ago
|
||
checked in
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Comment 110•19 years ago
|
||
I think we need to add the following codes
if (NS_FAILED(rv))
return rv;
after
else if (rv != NS_ERROR_FILE_ACCESS_DENIED)
{
// If that failed for any reason other than NS_ERROR_FILE_ACCESS_DENIED,
// assume we tried an NFS that does not support it. Now, try with symlink
// using the old symlink path
rv = LockWithSymlink(oldFilePath, PR_FALSE);
}
if not, mHaveLock will be set to PR_TRUE even if current mozilla fail to get the
lock.
Assignee | ||
Comment 111•19 years ago
|
||
It's OK to set mHaveLock even if we didn't actually acquire the lock. For
example the Mac code path always set it.
In fact, in this case it would not be correct to skip setting mHaveLock there as
we might have taken the fcntl lock and had some kind of error placing the
symlink lock ... then if mHaveLock is PR_FALSE, the cleanup code would not
release the fcntl lock.
Comment 112•19 years ago
|
||
*** Bug 297969 has been marked as a duplicate of this bug. ***
Comment 113•19 years ago
|
||
*** Bug 316203 has been marked as a duplicate of this bug. ***
Updated•9 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•