Closed Bug 151188 Opened 22 years ago Closed 19 years ago

Pid-signature on profile lock not unique (was: No option to ignore profile lock), results in "profile is in use" message after crash

Categories

(Core Graveyard :: Profile: BackEnd, defect, P1)

x86
FreeBSD
defect

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.8alpha3

People

(Reporter: BenB, Assigned: roc)

References

Details

(Whiteboard: [cb] no progress for 1.8b3?)

Attachments

(2 files, 3 obsolete files)

Reproduction: 1. Make it so that the profile is locked, but Mozilla is not running. That sometimes happens during crashes (e.g. of the machine). 2. Start Mozilla, selecting that locked profile Actual result: Mozilla tells you that the profile is locked, because another instance of Mozilla is using it. That is untrue (there is no other instance). There is no way in the UI to ignore the lock and start Mozilla with that profile nevertheless. Without going to the filesystem, you cannot use the profile anymore at all - it is lost. Expected result: 1. Mozilla checks, if there is actually another Mozilla instance running. - If no other instance found 2. guess that the lock is a leftover from an abort. 3. Tell the user so, warning about potential profile corruption and warning that there might indeed be another Mozilla instance that we just didn't find, e.g. a Mozilla on another computer, which uses the same profile on a network-shared folder. 4. Allow to stop/proceed, default proceed. 5. Remove lock, if proceed. - If we another Mozilla instance found running 2. Pop up the dialog we have today 3. The dialog offers a way to ignore the lock, on the user's risk. 4. Default is to honor the lock and not to proceed. I see that this is a bit of work. The most important part are the lalst 2 steps - allowing a way to override the lock in the current dialog. Workaround: 1. Remove the file "lock" in your profile. Severity: The should not be hard to implement, and it easy to work around, but normal users will not find that workaround, which makes their emails and bookmarks practically (almost) unaccessible, which is practically major dataloss for them. Thus, severity critical, although it's surely arguable. In any case, I think this should be fixed soon - by far worse than a crash IMO. Additional Comments: - Might be cross-platform, dunno. - I surely appreciate that we seem to allow 2 instances of Mozilla with different profiles at the same time.
AFAIK after a comment from ccarlen's we will not add an ignore. (not as pref, hidden pref or something else) You can file a bug that mozilla should detect that the profile is not in use. (and not critical, since there is no dataloss, memory leak or crash)
Severity: critical → major
> Mozilla tells you that the profile is locked, because another instance of > Mozilla is using it. That is untrue (there is no other instance). this part is a bug, but I have never seen it (except for bug 145061). Mozilla checks the pid the lock file claims to have and if it does not exist, ignores / removes the lock file. What you're seeing could easily happen over NFS. see also bug 122698
I saw this at least twice since my machine locks up hard occasionally recently. I use Mozilla 1.0-code. No NFS or network shares involved. There definitely was no other Mozilla instance, because the machine just booted. > (and not critical, since there is no dataloss, memory leak or crash) As I outlined above, this bug equals complete dataloss of the whole profile for normal users without external help.
Severity: major → critical
->backend (As we speak, it happened again.)
Assignee: ben → ccarlen
Component: Profile Manager FrontEnd → Profile Manager BackEnd
Sounds like a problem in the fix for bug 76431. We do check that the process that acquired the lock is alive so crashing while holding the lock should not be an issue - except perhaps if some other process gets the same process id next time around. There are also signal handlers installed which should remove the lock in the event of a crash. I need to check and see whether pid rollover could be to blame. Anyway, I don't think there should be an option to ignore the lock. The locking code needs to avoid getting into this situation and be absolutely fail-safe.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.1beta
Brendan, check out comment #5. Does process id rollover seem likely? To avoid this on the Mac, I record in the lock file both the process id *and* the time at which the process was launched in order to prevent this.
Ben, when this problem occurs, could you check the lock file? Is it pointing to a PID that exists?
I do not believe PID rollover can happen on modern Unixes. Someone needs to dig a little more here, diagnose the problem, and not call for a bogus solution (a way to ignore the lock). /be
> could you check the lock file? Is it pointing to a PID that exists? Yes, it was in at least 2 instances. > I do not believe PID rollover can happen on modern Unixes. It is even likely, if you consider the case in question (machine crash, reboot). PIDs are assigned sequentially, so it is very likely that processes started during the boot get the same PID as the last time. Extreme case: init is always PID 1. Similarily, if the user always starts Mozilla directly after the login. BTW: I notice that the lock is not a file, but a symlink to "<ip>:<pid>". Not sure, if that's legal, but it's unusual - usually, PID files have the PID as the (only) content (see /var/run/).
Win98, Mozilla 1.1a: Mozilla crashed several times when trying to get Enigmail working (usually IPC console freeze/crash) which eventually ended by killing the process (winoldap), killing Mozilla simultaneously. After several of these types of crashes, I saw this bug. When launching Mozilla, the profile manager comes up asking me to select a profile (w/ only one profile to select from) and when I click "Start Mozilla", I get the following Alert: "Mozilla cannot use the profile "earthsound" because it is in use. Please choose another profile or create a new one." I have no other instance of Mozilla running. I will see if rebooting the box will do any good and get back to this bug...
After a reboot, Mozilla launched incident free.
i'm nearly certain that i had pid rollover in FreeBSD 4.4 while i was at worldgate. otoh my box was up for nearly my entire term there (no reboots, no panics, no kernel updates, no power failures). QNX seemed to use 10digit numbers for pids and whether it actually could roll over would probably not matter since it didn't seem likely that randomness would give you a reasonable chance to hit the same pid. ignoring locks would be stupid, but having a capture/replace lock option might be reasonable.
*** Bug 145363 has been marked as a duplicate of this bug. ***
*** Bug 157223 has been marked as a duplicate of this bug. ***
> I do not believe PID rollover can happen on modern Unixes. As a note, just building mozilla a few times will lead to PID rollover on a typical RedHat Linux install running on the Intel PC arch.... (yay all those shell and make processes).
Are multiple instances of Mozilla on DIFFERENT machines sharing a single home directory via NFS simply not allowed? If so, the locking design needs to be seriously re-examined..
> Are multiple instances of Mozilla on DIFFERENT machines sharing a single home > directory via NFS simply not allowed? Absolutely. The bug I think you're looking for is bug 135137
*** Bug 163854 has been marked as a duplicate of this bug. ***
I still don't think any UI should let users override the lock. If we had to do that, we'd be in a situation where non-experts would choose "override" out of ignorance. Experts can rm. This bug should be about qualifying the pid-sig in some backward compatible way. Linux kernel helpers (shaver!), is there no way to get the guaranteed-unique, 32 bit pid from the system? /be
Summary: No option to ignore profile lock → Pid-signature on profile lock not unique (was: No option to ignore profile lock)
> This bug should be about qualifying the pid-sig in some backward compatible way. I agree completely. Can we just encode the launch time of the process which acquired the lock into the string in addition to the PID? That way, we can parse for two tokens, and if we find only one, it's an older app without this change and we just have to believe the lock is valid. All we need is, given a PID, find out the time at which that process was launched. Anybody know how to do this?
ccarlen, I am not sure what you are talking about. This bug is about the fact that my system locked up and Mozilla left a lock in the profile. After reboot, I tried to start Mozilla again and I couldn't, because the lock prevented it. So, this has absolutely nothing to do with PID rollover (which means - I infer from your discussion - the same PID for 2 different processes *during the same OS session*). PID rollover could just be another, very unlikely, case where the same result appears. I don't think that my case is an "expert" situation. Computers can crash hard for newbies as well. Consequently, we should not rely on expert knowledge to recover from that situation.
> I am not sure what you are talking about > encode the launch time of the process in addition to recording the PID of the mozilla process, record the start time. this would resolve PID rolover AND PID re-use after reboot. > Anybody know how to do this? I think this would be very platform-specific. On Linux, you could look in /proc/(PID)
> So, this has absolutely nothing to do with PID rollover I think it must. The only way that this could happen is that mozilla left a lock file, the system crashed and was rebooted and, next time around when mozilla checked the PID of the lock file, some other process now had that PID. Knowing the time at which the process with that ID was launched would allow us to detect this.
OK, and what do you do in that case? You see "There's a lock, and it's not from me." What then?
If the launch time enocoded in the lock does not match the launch time of the now running process with the PID which is encoded in the lock, you know the lock is invalid and claim it.
*** Bug 156836 has been marked as a duplicate of this bug. ***
*** Bug 156293 has been marked as a duplicate of this bug. ***
*** Bug 183903 has been marked as a duplicate of this bug. ***
*** Bug 189891 has been marked as a duplicate of this bug. ***
Setting to milestone that's not passed.
Target Milestone: mozilla1.1beta → mozilla1.4beta
Setting All/All per comment 26.
OS: Linux → All
Hardware: PC → All
> Setting All/All per comment 26. Yeah, but now that the change mentioned in that comment has been made, it's no longer All/All.
OS: All → Linux
Hardware: All → PC
Mozilla 1.3b, Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3b) Gecko/20030210, Windows 2000 I grabbed a fresh install of Mozilla 1.3b and ran the GRE installer while Mozilla was running in it's first instance. The GRE installer prompted me to shut down the running instance of mozilla, which I agreed to, and the GRE installer proceeded to shut down the exisiting instance after notifying me about Quicklaunch. After GRE installation completed, I relaunched Mozilla and was greeted by the profile manager. The Default profile was selected and clicked Start Mozilla. Instead of Mozilla, I was surprised by an alert which read: "Mozilla cannot use the profile "default" because it is in use. Please choose another profile or create a new one." Mozilla's Quicklaunch was not running when I tried the 2nd launch and a Mozilla wasn't visible within the Windows environment, nor was Mozilla listed in Task Manager>Applications. I could only start Mozilla after I killed the mozilla.exe process still listed in Task Manager>Processes. I will uninstall and see if I can repeat this. Would any patch applied toward this bug touch upon this manifestation of this bug, or should I file a new one?
> I could only start Mozilla after I killed the mozilla.exe > process still listed in Task Manager>Processes. earthsound, what you saw is a different bug. In your case, there actually was a Mozilla still running. In my case, there were none, but the lock was nevertheless there.
*** Bug 193136 has been marked as a duplicate of this bug. ***
*** Bug 198255 has been marked as a duplicate of this bug. ***
*** Bug 203733 has been marked as a duplicate of this bug. ***
My bug (bug #203733) got closed as a dup of this one. I would encourage the folks working on this bug to look at the comments I put there. I'm not sure that it is truly a duplicate, as the comments in this bug report seem to state that this is about fixing PID-related problems. To summarize bug #203733: the fact that the IP address is recorded seems to cause another variant of the PID-wrapping/PID-replay problem documented here. This is my reading of the code at http://lxr.mozilla.org/mozilla/ident?i=USE_SYMLINK_LOCKING: 2099 if (addr != inaddr.s_addr) 2100 { 2101 // Remote lock: give up even if stuck. 2102 break; 2103 } Timestamps don't solve the networking problem I've outlined. Users with profiles on NFS servers are a fact of life, and users migrating between client nodes is a part of that. Even if you claim "no we don't support that", DHCP is *also* a fact of life, and the existing scheme doesn't seem to work on a host whose IP address changes frequently, as is the case with many laptop users. When the IP addresses conflict, I don't see how you can reasonably fix this without a UI tweak. Perhaps you could only activate the UI option if the code detects that the local and lock IP addresses are different; this would alleviate some of the non-expert vs. expert fears. More generally, I disagree with Brendan's comments about experts; case in point: I got a call from my manager, who is not an "expert", but who ran into this problem. It's not her fault. This bug then caused me to spend time on figuring out and fixing the problem. How is that a good user experience for either her or me? No one likes having to call up the local nerd to fix his/her browser. Thanks for working on this, I'm glad to see it being addressed.
1. ping - a proposal to handle ip address changes in a network environment. when we examine the lock file and determine that the ip address doesn't match the local computer, try pinging the referenced computer. if it's down, we ass-u-me that the lock is bogus. this could fail for a worldwide nfs server where someone uses a profile in one place, and then moves to another corporation which blocks icmp. but i don't think that would be very likely. 2. readonly access - an offtopic proposal to handle annoyed users (this is yet another reason why it might be useful to support readonly access to profiles.) Instead of telling users that they can't use a profile when we can't get a lock, tell the user that their changes will be lost, but let the user use the profile. i believe n4 did this.
*** Bug 203832 has been marked as a duplicate of this bug. ***
timeless wrote: > 1. ping - a proposal to handle ip address changes in a network environment. > when we examine the lock file and determine that the ip address doesn't match > the local computer, try pinging the referenced computer. Erm... what will happen in the case of DHCP ? Just guess, please... =:-)
I'm sorry, I don't think the ping proposal is really viable; it certainly doesn't solve the university case, where a student is moving about from host to host in a computing cluster. In particular: computer A crashes while student is running mozilla. So she moves to computer B and then gets this message, since computer A still "holds" the lock. At the very least, you could provide the user the hint: This profile is locked by a copy of mozilla running on the remote computer "foobar" (PID 12345). If you are SURE that it is not in use, you may need to remove the file "/home/<user>/.mozilla/5t1x4ov6.slt/lock" and restart mozilla.
*** Bug 215257 has been marked as a duplicate of this bug. ***
some thoughts: 1. Mozilla should still try to tell whether a lock is bogus or not even knowing in some cases it couldn't at all. Once a lock is determined to be bogus, it should be cleared without know of the user. ( In local host case we could tell even a pid wrap happens) 2. Since mozilla never know whether a lock is bogus or not in the cases of nfs mounted home directory and such case is very common in enterprise environment, means should be provided to allow the user to delete the lock if he is sure it is safe. If we don't do this, such users get very bad user experience. 3. About what means provided to user. a message saying something like " An instance of mozilla running on "Hostname", PID=12345, is using this profile, if you are very sure it's not in use any more, you need to remove the file /home/..../.mozilla/xxxxxx/lock" and restart mozilla". Providing an button which directly clear the lock file may be dangerous, allowing for many users is likely to press a button before reading carefully the alarms and the severe results of running more than one mozillla on same profile. 4. Is there a resolvement to know whether a lock is alive or not in network environment? how about heart beat timestamp? although it has its own problems too
Blocks: 189595
*** Bug 225157 has been marked as a duplicate of this bug. ***
I would have really liked Dan's "Hint" message (comment 42) when I suffered this problem yesterday (Linux hard crash, machine rebooted). Although the lock symlink was broken, and therefore drawing a lot of attention in my colorful "ls -al" command, I felt quite nervous about removing this file with no easily-found documentation. IMHO, Dan's suggested text has two problems: (1) His message text states that it is a remote computer; this assumption may not be valid. But, if DHCP has assigned a new address to the local machine in my scenario, we can't use the address to confirm that it is local, either. Just list the machine addr, don't pin it down either way. (2) The phrase "you may need to" is too coy: If you are SURE that the profile isn't in use, and that it is the one you want to use, then "you should" remove the file. If the Severity is "critical", I feel that we should at least provide documentation of the work-around in the pop-up. Struggling to create code which handles duplicate PIDs and DHCP addresses seems much less important to me. The bug has been open for 15 months! Please advise if a separate bug is approriate, although I'm not very experienced with creating these things. BTW, if this bug summary contained the words "in use" rather than "lock", I would have found it. The current Profile Manager error message *DOESN'T CONTAIN* the word "lock".
*** Bug 226598 has been marked as a duplicate of this bug. ***
*** Bug 228064 has been marked as a duplicate of this bug. ***
[ I am submitter of bug #228064 in which I described a desired behavior close to comment #44 ] Just adding a comment to insist on the fact that this issue is *critical* for a beginner. I posted this issue after someone sent me a mail saying : "I'm writting to you because I have a BIG problem with Mozilla. Yesterday, opening it, it asked me something like chose/configure the account to use. [...] I don't know what I've done, but now, I can't use my "account" (my old mails, addressbook, and even my bookmarks) anymore" So, please * Add a message explaining the workaround in the *next* version of Mozilla (This is trivial, so, should be done *right now*). * Then, it becomes less critical, but it would still be really good to offer an option to delete the file. (keep in mind that ~/.mozilla is hidden, so, unreachable for a newbie :-( ... )
I got a new idea, but not sure my assumptions are right. I think heart beat is the only way to solve this problem. 1.A mozilla instance must periodly(say every 5 minutes) claims that it owns a lock by touching lock file( and write some stuff in it if we need). 2. if a lock has not been claimed for 5minutes(read it and then compare its last access time and last change time), the lock is thought to be bogus. assumptions: 1. on a net fs environment, the access/visit time of file is always from the net fs server. 2....
I just tried to move my mozilla profile between an old machine and my new one, and got bitten by this. I simply copied /home/rjn/.mozilla from one to the other, and for some reason, there was still a stale lock file in place. I agree with #42 - the profile manager should give more information in the error message, so that the user at least knows why there is an "erroneous error".
*** Bug 234199 has been marked as a duplicate of this bug. ***
*** Bug 235216 has been marked as a duplicate of this bug. ***
*** Bug 239615 has been marked as a duplicate of this bug. ***
The summary here is strange: "Pid-signature on profile lock not unique", The problem described in this bug happens with unique locks as well. I had a system crash yesterday with a 1.6 build running, and after reboot a stale lock was left around. Mozilla did not want to start while the lock existed. The PID number indicated by the lock was way higher than the PID would have been had i managed to start Mozilla anew. In the previous session, where it crashed and left the stale lock, Mozilla and several other processes had been started and terminated for weeks on end. The PID of the lock was 27000-something.
(In reply to comment #55) > I had a system crash yesterday with a 1.6 build running, and after reboot a > stale lock was left around. Mozilla did not want to start while the lock existed. > The PID number indicated by the lock was way higher than the PID would have been > had i managed to start Mozilla anew. In the previous session, where it crashed > and left the stale lock, Mozilla and several other processes had been started > and terminated for weeks on end. The PID of the lock was 27000-something. what confuses Mozilla is if the PID of the lock file corresponds to an existing process (any existing process). If you had no existing process using the lock's PID, then it's a different (and more serious) bug.
Andrew, rkaa said her PID file was for a PID 27000-something, and she had the problem after reboot. I don't think Linux assigns a PID of 27000 right after booting. rkaa and the rest: See comment 9 and 21. My problem was (IIRC) that you can have a stale, invalid lock pointing to the PID of your own and only Mozilla process, and it's even a likely case. Start computer Start Mozilla right after boot (automatically or manually) Machine crashes hard Reboot Start Mozilla right after boot (automatically or manually) In that case, it's likely that you run into this bug. Mozilla will likely leave a stale lock behind during the crash, and the Mozilla instance after reboot is likely to get the exactly same PID as during the last boot. No need for "PID rollover" or anything freaky or otherwise statistically unlikely. And, this bug means massive dataloss for the average user. He/she can't get to the profile, including stored emails, *at all* anymore. Mozilla will refuse to use it, even after re-installing Mozilla. Only after searching the net on user help sites will he/she ideally find the hint what to do. So, if the average user who runs into this problem either says goodbye to *all* Mozilla data or is up for major trouble and time-sinks. And it would be so easy to fix...
Please disregard my last comment. While it's a valid problem on its own, I must have run into other problems as well. Sorry, too late here.
I was going try implementing fcntl file locking (the "correct" way to do profile locking), but the code is already there. It's just avoided in deference to old NFS servers, or those not running lockd. The "old" NFS servers would be those from 2.0 kernels (user-space NFS daemon). Early 2.2 kernels were apparently less than optimal as well. Has enough time passed that we can depend on people: 1) running on their local machine 2) running with newer NFS (we're two major kernel versions beyond "old" kernels) Is it possible to detect such old nfs implementations and just use symlinks then?
No longer blocks: 189595
Please refer bug 253950 regarding a proposed change in the UI behaviour that would slightly alleviate the impact of this issue in some cases. I submitted that bug for Firefox but Mozilla's behaviour is analogous.
*** Bug 246077 has been marked as a duplicate of this bug. ***
*** Bug 208162 has been marked as a duplicate of this bug. ***
*** Bug 261106 has been marked as a duplicate of this bug. ***
*** Bug 264661 has been marked as a duplicate of this bug. ***
Tweaking summary to hopefully catch some dupes. Sorry for spam.
Summary: Pid-signature on profile lock not unique (was: No option to ignore profile lock) → Pid-signature on profile lock not unique (was: No option to ignore profile lock), results in "profile is in use" message after crash
*** Bug 257306 has been marked as a duplicate of this bug. ***
*** Bug 264793 has been marked as a duplicate of this bug. ***
it seems to me like the real problem was fixed in bug 265313, and was not caused by a reused pid.
I don't think so. There's no indication that locking is broken per se. FWIW, the lock is subsequently (a day or two) released.
No, biesi, I don't think so. This bug was observed in a case where ******* the machine crashes (and so the lock did not get removed) and after a reboot of the machine, the new Mozilla instance got exactly the same PID as before reboot - several times. ******* So, the PID in the lock file will point to a valid, running Mozilla (in fact the checking Mozilla itself in this case, but not necessarily), but it's still not the same process that created the lock file, the latter process crashed before reboot. Here's how it can happen (I already mentioned it in comment 9, but I'll try with more words): Look at |ps ax|. Note that the processes (sorted by PIDs) appear in the order they were started. The processes started automatically from initd during boot will come first. If you write down their PIDs, reboot, do the same again, you may find that they have exactly the same PIDs again, although you rebooted in the meantime, so they are definitely different processes. Whether the PIDs do happen to match exactly, may depend on your OS (kernel; how boot scrips are implemented, e.g. if it runs a |rm| separately over each of the files in /tmp/, whose number will vary). But *if* PIDs are assigned sequentially without randomness (which I think is the case), and *if* exactly the same processes get started in the same order after reboot (which is also not unlikely due to initd and co), it's likely that your syslog, apache and kdm and maybe even the GUI apps started by KDE session manager or another method all have the same PIDs they had after your last boot. At this point you already have a problem, if you have an old lock file lying around. Now add to the equation a machine that has hardware problems, thus crashes hard very often. This makes the problem more likely to appear. Mozilla will not have had a chance to remove the lock file (nothing happened anymore, not even signal handlers or the kernel), and you perform the same boot sequence a lot of times (because of a lot of crashes in a row), *and* due to the kernel crashing with the machine, you as a user *assume* that there is dataloss, so it's not all that suspicious when Mozilla tells you that you can't use your profile anymore, so you (as a by now terribly unnerved and unpatient end user) are more likely to write it off and start a new one, although your profile might have been just fine. This was my case (to my knowledge), so I am not constructing this. Granted, the machine crash is a hazardous environment to live in, but this is an unnecessary problem that Mozilla adds, and it can happen in other cases, too. I'm sure I'm not the only one to start Mozilla immediately after boot. In this case, the current *idea* of profile locking doesn't work as expected and needs to be adapted. Proposal: If the PID in the lock file exists - check that it's a Mozilla (don't check for process being called "mozilla", because it may be called "netscape" or "beonex-comm" or whatever, but use other methods, e.g. try to communicate with it). If it's not a Mozilla, ignore the lock file (or allow to override it) - check that it's not yourself (checking process). If it is, ignore the lock file. (I'm assuming that every Mozilla process has at most one profile in use.) Maybe there's a better fix, though, e.g. Alternate Proposal: - if the PID in the lock is running, try to communicate with that Mozilla and ask it, which profile directory it's using currently. Only lock the profile, if the other Mozilla tells you that it's using the same that you are trying to use. If the communication fails, because it's not a Mozilla (or the Mozilla doesn't react anymore), ignore the lock. If you get a such request (e.g. from yourself), answer 'no profile'. I still think that a user override (with warnings) is probably a good idea, until the above gets implemented (this bug is now 2.5 years old) and in general for other bugs, because no code is infallible. Do not assume that the user searches on Google about how to remove the lock file.
simpler, just make sure the lock is younger than the pid. if the lock is older than the pid, then it's bogus.
timeless, that's what I was talking about back in comment 20. I don't currently have a Linux box to pursue that fix, but if somebody does, I still think it's the way to go.
This adds a timestamp to the lock symlink's content, which can be compared to boot time on systems with sysinfo(2) or something like it. Porting help needed! /be
Attachment #170712 - Flags: superreview?(bryner)
Attachment #170712 - Flags: review?(ccarlen)
Comment on attachment 170712 [details] [diff] [review] proposed improvement As-is, this will fail to compile on Mac, and probably some other systems as well. We may want an autoconf test for sysinfo() if it does exist on systems other than Linux. On Mac, I believe you'll want to use sysctl() with CTL_KERN/KERN_BOOTTIME to get the system start time. Dunno if other systems have that sysctl (BSD?)... if so, maybe we should have an autoconf test for that as well.
Attachment #170712 - Flags: superreview?(bryner) → superreview-
Comment on attachment 170712 [details] [diff] [review] proposed improvement Ok, I'll work with a Mac buddy (or steal the wife's laptop and install tools) to get a Mac-tested patch here. Comments and porting interdiffs still welcome. /be
Attachment #170712 - Flags: superreview-
Attachment #170712 - Flags: review?(ccarlen)
Attached patch proposed improvement, v2 (obsolete) — Splinter Review
This ought to work on OS X -- testing welcome (haven't requisitioned the wife's G4 powerbook yet! ;-). /be
Assignee: ccarlen → brendan
Attachment #170723 - Flags: superreview?(bryner)
Attachment #170723 - Flags: review?(ccarlen)
Comment on attachment 170723 [details] [diff] [review] proposed improvement, v2 >--- nsProfileLock.cpp 12 Nov 2004 19:26:37 -0000 1.7 >+++ nsProfileLock.cpp 9 Jan 2005 05:47:15 -0000 >@@ -49,16 +49,17 @@ > #endif > > #ifdef XP_UNIX > #include <unistd.h> > #include <fcntl.h> > #include <errno.h> > #include <signal.h> > #include <stdlib.h> >+#include <sys/sysinfo.h> This will fail to compile on Mac. I'm not really in agreement with the assumption that XP_UNIX means that sysinfo will be available. This functionality should be in NSPR or at the very least should be an autoconf test. >+#if defined(XP_MACOSX) >+ int mib[2]; >+ size_t size; >+ struct timeval boottv; >+ >+ mib[0] = CTL_KERN; >+ mib[1] = KERN_BOOTTIME; >+ size = sizeof(boottime); boottime is not declared anywhere. Assuming you meant to use 'boottv'. This could be condensed some... int mib[2] = { CTL_KERN, KERN_BOOTTIME }; struct timeval boottv; size_t size = sizeof(boottv);
Attachment #170723 - Flags: superreview?(bryner) → superreview-
Comment on attachment 170723 [details] [diff] [review] proposed improvement, v2 Have to get to a Mac tomorrow and test things first. Also need some autoconf 101 help. /be
Attachment #170723 - Attachment is obsolete: true
Attachment #170723 - Flags: review?(ccarlen)
Priority: -- → P1
Ok, I'm bogging. My patch needs autoconf love and I'm swamped with other, higher priorities. Anyone willing to take this into the end zone? /be
is there a reason not to implement comment 59?
Comment 59 wasn't a concrete implementation plan, unless you ignore the final sentence: "Is it possible to detect such old nfs implementations and just use symlinks then?" If you have an answer to that question, lay it on us. If you just wanna go for it and require working NFS locking, I'd love to see that work too -- but it would be better to do that in an early alpha. Otherwise who is to say that the number of users exposed to broken NFS locking won't be too high for us to live with? /be
actually, looking at the source again, the Mac implementation already falls back to symlinks, specifically because of old NFS. The fcntl man page says it should return ENOLCK if it hits an old NFS implementation (consistent with http://www.linuxhq.com/lnxlists/linux-kernel/lk_9905_02/msg00150.html), which would mean the current use in Mozilla would work properly and fall back to symlinks. The comment in the source says: http://lxr.mozilla.org/seamonkey/source/profile/dirserviceprovider/src/nsProfileLock.cpp#548 // First, try the 4.x-compatible symlink technique, which works with NFS without // depending on (broken or missing, too often) lockd. There seem to be 3 reasons stated there: NS4 compatibility, missing lockd and broken lockd. I don't see NS4 compatibility as relevant anymore and missing lockd should trigger a fallback to symlinks. The lockd would need to be not only old, but broken (claiming to support remote locking, but not).
Ok, how about a patch to use the XP_MAXOSX code everywhere? We'd still need compile-time tests for fcntl and its alternatives. /be
I have this working, but with one problem: If an newer (fcntl'ing) build of Mozilla is running and an older (symlinking) build of Mozilla tries to use the same profile, it will succeed in creating the symlink lock file and start. As currently implemented, the symlink gets a different filename than the fcntl filename, but even if they were the same, the symlinking method just deletes anything that's not a symlink. I can add code to each method to not clobber the other type of lock, but that won't prevent older builds from doing it. I guess it could try to get both types of lock file, but that seems a bit overboard.
Just FYI, another, far simpler and more obvious way to run into this bug (I just did): After hard reboot, I have a $ ls -la lock lrwxrwxrwx 1 browser users 16 2005-01-25 15:21 lock -> 192.168.3.1:3522 $ ps ax|grep 3522 3522 ? S 0:00 kdeinit: kio_file file /mnt/ram/tmp/ksocket-ben/klaunch (192.168.3.1 is my IP address.) This prevented Mozilla 1.7.x from using the profile. So, no need for complicated situations like in my comment 70 or PID rollover to reproduce this. I don't know what "fcntl file locking" is, so I can't say, if it's important.
Flags: blocking1.8b2?
Flags: blocking-aviary1.1?
Attachment #174189 - Flags: review?(brendan)
If you enable fcntl based locking (which I think is a good idea), then you also need to write out the symlink file for backwards compat. We must do this so long as both new and old versions of our apps locate the profile in the same way.
right (see comment 84). But if we still create a symlink file, it will still get left behind and this bug will still exist. Is it really likely for people to have two different versions of the same app installed and try to use them simultaneously?
> right (see comment 84). But if we still create a symlink file, it will still > get left behind and this bug will still exist. Yes. That is why we need to have better UI when that problem occurs. We need to tell the user that Firefox appears to already be running, and then we need to give them the option to continue anyways (asking them to verify that Firefox is really not already running before proceeding). Showing the "select profile" dialog is just awful UI. > Is it really likely for people to have two different versions of the same app > installed and try to use them simultaneously? Yes. Suppose someone has Firefox 1.0 installed on their system using their distro's RPMS, and then suppose they also downloaded a fresh new copy of Firefox 1.1 from Mozilla.org. Now, suppose for whatever reason Firefox 1.1 is not quite setup with their system properly, so clicking http:// links in other products like GAIM causes the system version of Firefox to run. If they had Firefox 1.1 running at the time, they'd see a Firefox 1.0 window popup, and maybe they wouldn't think anything of it. They might use it, and in the process they may end up trashing their profile. This scenario doesn't seem like an edge case to me. (NOTE: we had this problem when we first started doing symlink locking for Mozilla 1.0. Netscape 6 would trash the Mozilla 1.0 profile IIRC.)
one more thing: if symlink locking is a failover for when fnctl based locking fails, then we need to solve the symlink problem anyways, so I don't see supporting symlink locking for backwards compatibility as being much more work.
> clicking http:// links in other products > like GAIM causes the system version of Firefox to run Shouldn't the firefox starter binary find a running instance of Firefox, regardless of version (but regarding app), so starting Firefox 1.0 should just open a new window in the running Firefox 1.1?
yes, good call. i forgot about the fact that X-remote is now properly integrated into the firefox shell script! ;-) hmm... that may indeed be a convenient way around the problem i was describing. can anyone think of a way that X-remote might be defeated?
It's not the binary that looks for running instances but the startup script. If you pass any arguments to the script (like -inspector, --sync or -P) it will not try to do X-remote. The startup script ought to handle some of them, but it doesn't.
and Seamonkey script, which doesn't try to do X-remote at all (such intuitive behavior is undesired)
(In reply to comment #87) > If you enable fcntl based locking (which I think is a good idea), then you also > need to write out the symlink file for backwards compat. I have a suggestion for how to implement this: create the symlink (and remove it on exit) but give the file it points to a new name, e.g., fcntl.lock. Then, if the symlink already exists and points to fcntl.lock you know the other instance supports fcntl locking and can use it to determine whether the lock is still valid. If the symlink points to another file, the other instance is an old version and you need to revert to the old behaviour. This way you maintain backwards compatibility but still get the improved behaviour once all instances have the new code. Harry.
Flags: blocking-aviary1.1? → blocking-aviary1.1+
Flags: blocking1.8b2? → blocking1.8b3+
Target Milestone: mozilla1.4beta → mozilla1.8alpha3
Reassigning to Josh per Deer Park meeting. Thanks Josh!
Assignee: brendan → joshmoz
Status: ASSIGNED → NEW
seawood, can you take a look at this with us?
Whiteboard: [cb] no progress for 1.8b3?
Flags: blocking1.8b3+ → blocking1.8b3-
*** Bug 301529 has been marked as a duplicate of this bug. ***
Darin's backward-compatiblity argument doesn't apply if the RPM distributor is the one who applies this patch :-).
A variant of comment #95 would work well, I think. If the link is to something like "192.168.199.3:7777:obsolete" then today's Firefox builds would treat it exactly the same --- a valid lock, if pid 7777 exists or the IP address is different. But future Firefox builds could recognize that this was not the real lock, that the process that created it made the real lock using fcntl, and if the fcntl lock is acquired then the symlink is a stale lock and can be deleted.
http://lxr.mozilla.org/seamonkey/source/profile/dirserviceprovider/src/nsProfileLock.cpp#301 301 char *after = nsnull; 302 pid_t pid = strtol(colon, &after, 0); 303 if (pid != 0 && *after == '\0') with "192.168.199.3:7777:obsolete", |*after| will be ':' (the second colon), the link will be assumed to be bogus and deleted. Any such flag would need to be more subtle (perhaps setting the date to 1970).
Assignee: joshmoz → nobody
Good point. We could make the obsolete flag be a "+" in front of the PID. That would be ignored by strtol.
OS: Linux → FreeBSD
Attached patch fix (obsolete) — Splinter Review
This patch does that. It works.
Assignee: nobody → roc
Attachment #174189 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #190038 - Flags: superreview?(brendan)
Attachment #190038 - Flags: review?(bryner)
Attachment #190038 - Flags: superreview?(brendan)
Attachment #190038 - Flags: review?(bryner)
Attached patch real fixSplinter Review
Sorry, the previous patch was the wrong patch. This is the right patch.
Attachment #190038 - Attachment is obsolete: true
Attachment #190043 - Flags: superreview?(brendan)
Attachment #190043 - Flags: review?(bryner)
Comment on attachment 190043 [details] [diff] [review] real fix As Basil Fawlty would say, BRILLIANT! Just lose your wacky Java-ish style ;-) and sr+a=me, contingent on bryner's r+. >-nsresult nsProfileLock::LockWithSymlink(const nsACString& lockFilePath) >+static PRBool IsSymlinkStaleLock(struct in_addr* aAddr, const char* aFileName, >+ PRBool aHaveFcntlLock) { Wacky Java-ish style nit-picks: no need for static, and put the { on its own line in column 1. > nsresult LockWithFcntl(const nsACString& lockFilePath); >- nsresult LockWithSymlink(const nsACString& lockFilePath); >+ /** >+ * @param aHaveFcntlLock if true, we've already acquired an fcntl lock so this >+ * lock is merely an "obsolete" lock to keep out old Firefoxes >+ */ How about a blank line before this doc-comment? So the remaining concern for Unixen was the fear, going back to 1995, that some lockd implementations hang when you try fcntl -- but such gross bugs must have been fixed by now (right?). So I'm not concerned about trying fcntl first. Randell Jesup, speak up if you are still concerned. Great work, thanks for cutting the Gordian Knot here. /be
Attachment #190043 - Flags: superreview?(brendan)
Attachment #190043 - Flags: superreview+
Attachment #190043 - Flags: approval1.8b4+
(In reply to comment #105) > >-nsresult nsProfileLock::LockWithSymlink(const nsACString& lockFilePath) > >+static PRBool IsSymlinkStaleLock(struct in_addr* aAddr, const char* aFileName, > >+ PRBool aHaveFcntlLock) { > > Wacky Java-ish style nit-picks: no need for static, and put the { on its own > line in column 1. 'static' does have an effect. A good compiler would notice that there's exactly one use of this static function and forcibly inline the whole thing. I think gcc even does that. > How about a blank line before this doc-comment? Sure.
*** Bug 301941 has been marked as a duplicate of this bug. ***
Comment on attachment 190043 [details] [diff] [review] real fix Looks good. Re: the comment about checking whether the running process is really the right app, it would be great if we could somehow get the window id for the process and then query it with xremote. But that's for another day.
Attachment #190043 - Flags: review?(bryner) → review+
checked in
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
I think we need to add the following codes if (NS_FAILED(rv)) return rv; after else if (rv != NS_ERROR_FILE_ACCESS_DENIED) { // If that failed for any reason other than NS_ERROR_FILE_ACCESS_DENIED, // assume we tried an NFS that does not support it. Now, try with symlink // using the old symlink path rv = LockWithSymlink(oldFilePath, PR_FALSE); } if not, mHaveLock will be set to PR_TRUE even if current mozilla fail to get the lock.
It's OK to set mHaveLock even if we didn't actually acquire the lock. For example the Mac code path always set it. In fact, in this case it would not be correct to skip setting mHaveLock there as we might have taken the fcntl lock and had some kind of error placing the symlink lock ... then if mHaveLock is PR_FALSE, the cleanup code would not release the fcntl lock.
Blocks: 303633
*** Bug 297969 has been marked as a duplicate of this bug. ***
*** Bug 316203 has been marked as a duplicate of this bug. ***
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: