Closed Bug 156836 Opened 22 years ago Closed 22 years ago

Use flock() for profile locking

Categories

(Camino Graveyard :: General, defect)

PowerPC
macOS
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED DUPLICATE of bug 151188
Chimera0.4

People

(Reporter: sfraser_bugs, Assigned: ccarlen)

Details

Attachments

(1 file)

Chimera (and perhaps mozilla?) create a lock file (or symlink) in the profile
directory to indicate that the profile is in use. The problem with this is that
the lock can remain in place if the program crashes, meaning that subsequent
attempts to run the program erroneously report that it's already running.

We should use flock() instead, since the kernel cleans these up when the process
dies.
cc'ing brendan and bryner since i recall them talking about profile locking and
NFS volumes recently on irc.
Yes. I suggested this recently. There is code in place that uses fcntl with
F_SETLK. It has the same advantages as flock. See
http://lxr.mozilla.org/mozilla/source/profile/src/nsProfileAccess.cpp#1959.

The symlink code was decided as the first course of action because of F_SETLK
not being available on all NFS servers. Read from
http://bugzilla.mozilla.org/show_bug.cgi?id=76431#c126 on down a few comments.
The symlink code that Brendan did sets signal handlers which should clear the
locks on a crash. How this is failing is strange indeed. Is this failing? Do you
have a way to repro it?

Still, the fcntl code is more failsafe and it's also faster. The symlink code
has to spend time looking up the IP address of the current process, etc. The
time to make this change (with a simple #ifdef XP_MACOSX to head straight to the
fcntl method) would have been before the preview was released. Locks made with
one method are not visible to the other.

If we're willing to accept that and change to use the fcntl locking scheme
before we send more versions out the door, let's do it - it's easy. We would
have to make the disclaimer that, if you're using a Chimera before this change
was made and a build with this change, you run the risk of profile corruption.
Status: NEW → ASSIGNED
Component: Profile Manager BackEnd → General
Product: Browser → Chimera
Target Milestone: --- → Chimera0.4
Version: other → unspecified
Broken or missing lockd is a function of the NFS server, not the client, so any
OSX client running with such a server is going to hate flock/fcntl-based profile
locking.  Why is OSX any different from other Unixes here?  Simon, please see
bug 76431, which points to the long history of the symlink-locking code.

Didn't bryner and shaver have a way to get the primary IP address without using
DNS?  Was that in some other bug, or this one?

/be
pink pointed out that this bug doesn't just want to use flock for the dozens of
milliseconds of speedup it might give, and for the malfunction fun with NFS home
directories it would give -- it seems to want to use flock to paper over some
problem with the symlink locking code not detecting a stale lock and removing
it.  Is that so?

As ccarlen points out, there is signal-handling code to remove a stale lock. 
But there is also code to invalidate a symlink lock if its pid identifies no
live process.  Can someone who sees a stuck symlink lock not being invalidated
on subsequent startup ls -l it, and then ps -p that pid, and say whether there
is a process with that id?

/be
BTW, as long the symlink locking code has been in place, I have seen only 1 bug
about locks being left behind by crashes - and that was because, IIRC,  the
process ID was reused when the person rebooted. That's my fear about the symlink
method - if a local process with a certain process ID holds the lock and dies,
how are we absolutely sure that another process, next time around, doesn't get
that same process ID?
ccarlen: we can't be sure across a reboot, but we can be sure if only Mozilla or
Chimera crashes, and the OS stays up, that the pid likely won't be reused
(modern Unixes use 32 bit pits; even at thousands of forks per second, which is
far from the mean, you'll have millions of seconds till wraparound).

For the reboot case, we could use a system boot generation number, if there were
one.  Or in any case, if we found via kill(pid, 0) that the pid seems to
identify a live process, perhaps we could tell that it was not Mozilla.  If only
OSX had /proc support.

More in a bit.

/be
Help me test this.  Tell me what to do for OSX (pink sent me the source to
top.c, but I could use <sys/sysctl.h> too).

/be
Maybe adding boot-time in a compatible way to the symlink signature is better. 
/proc/uptime on Linux, what on OSX?

/be
I can confirm that the stale lock detection isn't working on Chimera/2002082905
and OS X 6C115. The pid in the symlink doesn't exist yet Chimera refuses to start.

Does anyone know if the NSDistributedLock class behaves correctly on an
application crash? If it does, it might be the way to go for Chimera.
I'm curious to learn where the pid-signature checking code fails on OSX --
anyone debug that far?

/be
> The pid in the symlink doesn't exist yet Chimera refuses to start.

Are you sure about that?

I tried to repro this by killing the program in these three ways: (1) force-quit
(2) powering down machine when Chimera was running, and (3) inserting code which
would cause a crash.

In all 3 cases, the stale lock detection worked. The only way I could make it
fail was to do #3, restart, start enough processes so the next would be Chimera,
then start Chimera. Then, of course, it fails. That's the recurring pid problem
(bug 151188) The other cases which worked properly show that the pid check is
working.

Maybe there's a better way to repro this?
Okay, you got me -- I can't reproduce it now. But it *did* fail for me in a
nasty way today; Chimera crashed and couldn't be restarted immediately after
this (I didn't reboot or do other things to roll the pids over). Well, blame it
on the sun spots... 
Re-reading this bug, I don't see why it should stay open.  Comment #4 discloses
info from pink (via IRC) suggesting that the summary here (which dictates a
solution without stating the problem being solved) is inaccurate.  If the
solution is "cope with pid recycling", then this bug is a dup of bug 151188. 
I'll leave it to ccarlen to mark it as such, unless there's new evidence or a
better statement of the (different from pid recycling) problem to be solved.

/be
The only problem with the current locking scheme which is proven to exist is bug
151188. Using flock() is not reasonable because of its limitations with some NFS
servers.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → WONTFIX
Verified wontfix
Status: RESOLVED → VERIFIED
I don't believe that PID recycling is the only issue here. I think the lock file
is not cleaned up for certain kinds of application crashes.
If anyone has evidence of fatal signal handling bugs in the profile locking
code, please file them.

Of course, SIGKILL is uncatchable, so the pid lock can remain after at least
some kinds of program terminations.  So what?  The profile locking code, once we
fix bug 151188, is robust in the face of a left-over lock file.

/be
I don't think PID recurrance after reboot is the only problem we face. The other
problem is this:

1) The machine crashes and is rebooted.
2) The machine is set up to get its IP address from DHCP and after the reboot,
it gets a different IP.
3) Next time around, the lock-checking code will think that lock is by a
non-local process and be forced to give up.

Unlike PID recurrance, I don't see a solution to this. For XP_MACOSX, I'm
thinking that the benefits of the fcntl method begin to outweigh its not working
on some NFS servers.
One solution would be to use an address that is not likely to change on reboot
or as frequently as a DHCP lease. Options include:

 - The link-local IPv6 address of the first interface (Jag only)
 - md5 hash of the computer name

The changing IP issue is solvable for non-system-crash situations. Chimera
should register with the Kernel Event Monitor and the Preferences Monitor to
receive notifications of changing network conditions:

 - The System proxy settings should be reloaded when a change occurs
   (worth a separate bug, perhaps).
 - On receiving a kernel notification about an interface change, Chimera
   would still know both the old and the new address, enabling it to find
   the old pid file and to rewrite it with the new info.
Okay, I can now replicate my problem with the profile locking. I use my
PowerBook both at work and at home, using the location manager to switch between
configurations. Each location has an unique IP address, but since I put my
PowerBook to sleep between the places instead of shutting it down, a single
Chimera instance can survive the trip.

If I start up Chimera at home and then crash it at work, the stale lock file
contains my home IP address and goes unrecognized. Chimera is forced to give up.

The browser needs to listen to interface changes and rewrite the lock file
accordingly.
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
This was done (using fcntl) by bug 176608 for Chimera and Mach-0 mozilla so it's
either fixed or invalid there. The remaining problem, for other platforms, is a
dup of bug 151188. Closing as a dup so a link to this bug and its patch ;-) will
show up there.

*** This bug has been marked as a duplicate of 151188 ***
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → DUPLICATE
Verified as a dupe of bug 151188
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: