Closed
Bug 156836
Opened 22 years ago
Closed 22 years ago
Use flock() for profile locking
Categories
(Camino Graveyard :: General, defect)
Tracking
(Not tracked)
VERIFIED
DUPLICATE
of bug 151188
Chimera0.4
People
(Reporter: sfraser_bugs, Assigned: ccarlen)
Details
Attachments
(1 file)
Chimera (and perhaps mozilla?) create a lock file (or symlink) in the profile directory to indicate that the profile is in use. The problem with this is that the lock can remain in place if the program crashes, meaning that subsequent attempts to run the program erroneously report that it's already running. We should use flock() instead, since the kernel cleans these up when the process dies.
Comment 1•22 years ago
|
||
cc'ing brendan and bryner since i recall them talking about profile locking and NFS volumes recently on irc.
Assignee | ||
Comment 2•22 years ago
|
||
Yes. I suggested this recently. There is code in place that uses fcntl with F_SETLK. It has the same advantages as flock. See http://lxr.mozilla.org/mozilla/source/profile/src/nsProfileAccess.cpp#1959. The symlink code was decided as the first course of action because of F_SETLK not being available on all NFS servers. Read from http://bugzilla.mozilla.org/show_bug.cgi?id=76431#c126 on down a few comments. The symlink code that Brendan did sets signal handlers which should clear the locks on a crash. How this is failing is strange indeed. Is this failing? Do you have a way to repro it? Still, the fcntl code is more failsafe and it's also faster. The symlink code has to spend time looking up the IP address of the current process, etc. The time to make this change (with a simple #ifdef XP_MACOSX to head straight to the fcntl method) would have been before the preview was released. Locks made with one method are not visible to the other. If we're willing to accept that and change to use the fcntl locking scheme before we send more versions out the door, let's do it - it's easy. We would have to make the disclaimer that, if you're using a Chimera before this change was made and a build with this change, you run the risk of profile corruption.
Status: NEW → ASSIGNED
Component: Profile Manager BackEnd → General
Product: Browser → Chimera
Target Milestone: --- → Chimera0.4
Version: other → unspecified
Comment 3•22 years ago
|
||
Broken or missing lockd is a function of the NFS server, not the client, so any OSX client running with such a server is going to hate flock/fcntl-based profile locking. Why is OSX any different from other Unixes here? Simon, please see bug 76431, which points to the long history of the symlink-locking code. Didn't bryner and shaver have a way to get the primary IP address without using DNS? Was that in some other bug, or this one? /be
Comment 4•22 years ago
|
||
pink pointed out that this bug doesn't just want to use flock for the dozens of milliseconds of speedup it might give, and for the malfunction fun with NFS home directories it would give -- it seems to want to use flock to paper over some problem with the symlink locking code not detecting a stale lock and removing it. Is that so? As ccarlen points out, there is signal-handling code to remove a stale lock. But there is also code to invalidate a symlink lock if its pid identifies no live process. Can someone who sees a stuck symlink lock not being invalidated on subsequent startup ls -l it, and then ps -p that pid, and say whether there is a process with that id? /be
Assignee | ||
Comment 5•22 years ago
|
||
BTW, as long the symlink locking code has been in place, I have seen only 1 bug about locks being left behind by crashes - and that was because, IIRC, the process ID was reused when the person rebooted. That's my fear about the symlink method - if a local process with a certain process ID holds the lock and dies, how are we absolutely sure that another process, next time around, doesn't get that same process ID?
Comment 6•22 years ago
|
||
ccarlen: we can't be sure across a reboot, but we can be sure if only Mozilla or Chimera crashes, and the OS stays up, that the pid likely won't be reused (modern Unixes use 32 bit pits; even at thousands of forks per second, which is far from the mean, you'll have millions of seconds till wraparound). For the reboot case, we could use a system boot generation number, if there were one. Or in any case, if we found via kill(pid, 0) that the pid seems to identify a live process, perhaps we could tell that it was not Mozilla. If only OSX had /proc support. More in a bit. /be
Comment 7•22 years ago
|
||
Help me test this. Tell me what to do for OSX (pink sent me the source to top.c, but I could use <sys/sysctl.h> too). /be
Comment 8•22 years ago
|
||
Maybe adding boot-time in a compatible way to the symlink signature is better. /proc/uptime on Linux, what on OSX? /be
Comment 9•22 years ago
|
||
I can confirm that the stale lock detection isn't working on Chimera/2002082905 and OS X 6C115. The pid in the symlink doesn't exist yet Chimera refuses to start. Does anyone know if the NSDistributedLock class behaves correctly on an application crash? If it does, it might be the way to go for Chimera.
Comment 10•22 years ago
|
||
I'm curious to learn where the pid-signature checking code fails on OSX -- anyone debug that far? /be
Assignee | ||
Comment 11•22 years ago
|
||
> The pid in the symlink doesn't exist yet Chimera refuses to start. Are you sure about that? I tried to repro this by killing the program in these three ways: (1) force-quit (2) powering down machine when Chimera was running, and (3) inserting code which would cause a crash. In all 3 cases, the stale lock detection worked. The only way I could make it fail was to do #3, restart, start enough processes so the next would be Chimera, then start Chimera. Then, of course, it fails. That's the recurring pid problem (bug 151188) The other cases which worked properly show that the pid check is working. Maybe there's a better way to repro this?
Comment 12•22 years ago
|
||
Okay, you got me -- I can't reproduce it now. But it *did* fail for me in a nasty way today; Chimera crashed and couldn't be restarted immediately after this (I didn't reboot or do other things to roll the pids over). Well, blame it on the sun spots...
Comment 13•22 years ago
|
||
Re-reading this bug, I don't see why it should stay open. Comment #4 discloses info from pink (via IRC) suggesting that the summary here (which dictates a solution without stating the problem being solved) is inaccurate. If the solution is "cope with pid recycling", then this bug is a dup of bug 151188. I'll leave it to ccarlen to mark it as such, unless there's new evidence or a better statement of the (different from pid recycling) problem to be solved. /be
Assignee | ||
Comment 14•22 years ago
|
||
The only problem with the current locking scheme which is proven to exist is bug 151188. Using flock() is not reasonable because of its limitations with some NFS servers.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → WONTFIX
Reporter | ||
Comment 16•22 years ago
|
||
I don't believe that PID recycling is the only issue here. I think the lock file is not cleaned up for certain kinds of application crashes.
Comment 17•22 years ago
|
||
If anyone has evidence of fatal signal handling bugs in the profile locking code, please file them. Of course, SIGKILL is uncatchable, so the pid lock can remain after at least some kinds of program terminations. So what? The profile locking code, once we fix bug 151188, is robust in the face of a left-over lock file. /be
Assignee | ||
Comment 18•22 years ago
|
||
I don't think PID recurrance after reboot is the only problem we face. The other problem is this: 1) The machine crashes and is rebooted. 2) The machine is set up to get its IP address from DHCP and after the reboot, it gets a different IP. 3) Next time around, the lock-checking code will think that lock is by a non-local process and be forced to give up. Unlike PID recurrance, I don't see a solution to this. For XP_MACOSX, I'm thinking that the benefits of the fcntl method begin to outweigh its not working on some NFS servers.
Comment 19•22 years ago
|
||
One solution would be to use an address that is not likely to change on reboot or as frequently as a DHCP lease. Options include: - The link-local IPv6 address of the first interface (Jag only) - md5 hash of the computer name The changing IP issue is solvable for non-system-crash situations. Chimera should register with the Kernel Event Monitor and the Preferences Monitor to receive notifications of changing network conditions: - The System proxy settings should be reloaded when a change occurs (worth a separate bug, perhaps). - On receiving a kernel notification about an interface change, Chimera would still know both the old and the new address, enabling it to find the old pid file and to rewrite it with the new info.
Comment 20•22 years ago
|
||
Okay, I can now replicate my problem with the profile locking. I use my PowerBook both at work and at home, using the location manager to switch between configurations. Each location has an unique IP address, but since I put my PowerBook to sleep between the places instead of shutting it down, a single Chimera instance can survive the trip. If I start up Chimera at home and then crash it at work, the stale lock file contains my home IP address and goes unrecognized. Chimera is forced to give up. The browser needs to listen to interface changes and rewrite the lock file accordingly.
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Assignee | ||
Comment 21•22 years ago
|
||
This was done (using fcntl) by bug 176608 for Chimera and Mach-0 mozilla so it's either fixed or invalid there. The remaining problem, for other platforms, is a dup of bug 151188. Closing as a dup so a link to this bug and its patch ;-) will show up there. *** This bug has been marked as a duplicate of 151188 ***
Status: REOPENED → RESOLVED
Closed: 22 years ago → 22 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•