156836 - Use flock() for profile locking

Reporter

Description

•

22 years ago

Chimera (and perhaps mozilla?) create a lock file (or symlink) in the profile
directory to indicate that the profile is in use. The problem with this is that
the lock can remain in place if the program crashes, meaning that subsequent
attempts to run the program erroneously report that it's already running.

We should use flock() instead, since the kernel cleans these up when the process
dies.

Mike Pinkerton (not reading bugmail)

Comment 1

•

22 years ago

cc'ing brendan and bryner since i recall them talking about profile locking and
NFS volumes recently on irc.

Conrad Carlen (not reading bugmail)

Assignee

Comment 2

•

22 years ago

Yes. I suggested this recently. There is code in place that uses fcntl with
F_SETLK. It has the same advantages as flock. See
http://lxr.mozilla.org/mozilla/source/profile/src/nsProfileAccess.cpp#1959.

The symlink code was decided as the first course of action because of F_SETLK
not being available on all NFS servers. Read from
http://bugzilla.mozilla.org/show_bug.cgi?id=76431#c126 on down a few comments.
The symlink code that Brendan did sets signal handlers which should clear the
locks on a crash. How this is failing is strange indeed. Is this failing? Do you
have a way to repro it?

Still, the fcntl code is more failsafe and it's also faster. The symlink code
has to spend time looking up the IP address of the current process, etc. The
time to make this change (with a simple #ifdef XP_MACOSX to head straight to the
fcntl method) would have been before the preview was released. Locks made with
one method are not visible to the other.

If we're willing to accept that and change to use the fcntl locking scheme
before we send more versions out the door, let's do it - it's easy. We would
have to make the disclaimer that, if you're using a Chimera before this change
was made and a build with this change, you run the risk of profile corruption.

Status: NEW → ASSIGNED

Component: Profile Manager BackEnd → General

Product: Browser → Chimera

Target Milestone: --- → Chimera0.4

Version: other → unspecified

Brendan Eich [:brendan]

Comment 3

•

22 years ago

Broken or missing lockd is a function of the NFS server, not the client, so any
OSX client running with such a server is going to hate flock/fcntl-based profile
locking.  Why is OSX any different from other Unixes here?  Simon, please see
bug 76431, which points to the long history of the symlink-locking code.

Didn't bryner and shaver have a way to get the primary IP address without using
DNS?  Was that in some other bug, or this one?

/be

Brendan Eich [:brendan]

Comment 4

•

22 years ago

pink pointed out that this bug doesn't just want to use flock for the dozens of
milliseconds of speedup it might give, and for the malfunction fun with NFS home
directories it would give -- it seems to want to use flock to paper over some
problem with the symlink locking code not detecting a stale lock and removing
it.  Is that so?

As ccarlen points out, there is signal-handling code to remove a stale lock. 
But there is also code to invalidate a symlink lock if its pid identifies no
live process.  Can someone who sees a stuck symlink lock not being invalidated
on subsequent startup ls -l it, and then ps -p that pid, and say whether there
is a process with that id?

/be

Conrad Carlen (not reading bugmail)

Assignee

Comment 5

•

22 years ago

BTW, as long the symlink locking code has been in place, I have seen only 1 bug
about locks being left behind by crashes - and that was because, IIRC,  the
process ID was reused when the person rebooted. That's my fear about the symlink
method - if a local process with a certain process ID holds the lock and dies,
how are we absolutely sure that another process, next time around, doesn't get
that same process ID?

Brendan Eich [:brendan]

Comment 6

•

22 years ago

ccarlen: we can't be sure across a reboot, but we can be sure if only Mozilla or
Chimera crashes, and the OS stays up, that the pid likely won't be reused
(modern Unixes use 32 bit pits; even at thousands of forks per second, which is
far from the mean, you'll have millions of seconds till wraparound).

For the reboot case, we could use a system boot generation number, if there were
one.  Or in any case, if we found via kill(pid, 0) that the pid seems to
identify a live process, perhaps we could tell that it was not Mozilla.  If only
OSX had /proc support.

More in a bit.

/be

Brendan Eich [:brendan]

Comment 7

•

22 years ago

Attached patch untested fix for Unixes that have /proc/<pid>/maps a la Linux — Details — Splinter Review

Help me test this.  Tell me what to do for OSX (pink sent me the source to
top.c, but I could use <sys/sysctl.h> too).

/be

Brendan Eich [:brendan]

Comment 8

•

22 years ago

Maybe adding boot-time in a compatible way to the symlink signature is better. 
/proc/uptime on Linux, what on OSX?

/be

Marko Karppinen

Comment 9

•

22 years ago

I can confirm that the stale lock detection isn't working on Chimera/2002082905
and OS X 6C115. The pid in the symlink doesn't exist yet Chimera refuses to start.

Does anyone know if the NSDistributedLock class behaves correctly on an
application crash? If it does, it might be the way to go for Chimera.

Brendan Eich [:brendan]

Comment 10

•

22 years ago

I'm curious to learn where the pid-signature checking code fails on OSX --
anyone debug that far?

/be

Conrad Carlen (not reading bugmail)

Assignee

Comment 11

•

22 years ago

> The pid in the symlink doesn't exist yet Chimera refuses to start.

Are you sure about that?

I tried to repro this by killing the program in these three ways: (1) force-quit
(2) powering down machine when Chimera was running, and (3) inserting code which
would cause a crash.

In all 3 cases, the stale lock detection worked. The only way I could make it
fail was to do #3, restart, start enough processes so the next would be Chimera,
then start Chimera. Then, of course, it fails. That's the recurring pid problem
(bug 151188) The other cases which worked properly show that the pid check is
working.

Maybe there's a better way to repro this?

Marko Karppinen

Comment 12

•

22 years ago

Okay, you got me -- I can't reproduce it now. But it *did* fail for me in a
nasty way today; Chimera crashed and couldn't be restarted immediately after
this (I didn't reboot or do other things to roll the pids over). Well, blame it
on the sun spots...

Brendan Eich [:brendan]

Comment 13

•

22 years ago

Re-reading this bug, I don't see why it should stay open.  Comment #4 discloses
info from pink (via IRC) suggesting that the summary here (which dictates a
solution without stating the problem being solved) is inaccurate.  If the
solution is "cope with pid recycling", then this bug is a dup of bug 151188. 
I'll leave it to ccarlen to mark it as such, unless there's new evidence or a
better statement of the (different from pid recycling) problem to be solved.

/be

Conrad Carlen (not reading bugmail)

Assignee

Comment 14

•

22 years ago

The only problem with the current locking scheme which is proven to exist is bug
151188. Using flock() is not reasonable because of its limitations with some NFS
servers.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → WONTFIX

K'Trina Medina

Comment 15

•

22 years ago

Verified wontfix

Status: RESOLVED → VERIFIED

Simon Fraser [no longer active]

Reporter

Comment 16

•

22 years ago

I don't believe that PID recycling is the only issue here. I think the lock file
is not cleaned up for certain kinds of application crashes.

Brendan Eich [:brendan]

Comment 17

•

22 years ago

If anyone has evidence of fatal signal handling bugs in the profile locking
code, please file them.

Of course, SIGKILL is uncatchable, so the pid lock can remain after at least
some kinds of program terminations.  So what?  The profile locking code, once we
fix bug 151188, is robust in the face of a left-over lock file.

/be

Conrad Carlen (not reading bugmail)

Assignee

Comment 18

•

22 years ago

I don't think PID recurrance after reboot is the only problem we face. The other
problem is this:

1) The machine crashes and is rebooted.
2) The machine is set up to get its IP address from DHCP and after the reboot,
it gets a different IP.
3) Next time around, the lock-checking code will think that lock is by a
non-local process and be forced to give up.

Unlike PID recurrance, I don't see a solution to this. For XP_MACOSX, I'm
thinking that the benefits of the fcntl method begin to outweigh its not working
on some NFS servers.

Marko Karppinen

Comment 19

•

22 years ago

One solution would be to use an address that is not likely to change on reboot
or as frequently as a DHCP lease. Options include:

 - The link-local IPv6 address of the first interface (Jag only)
 - md5 hash of the computer name

The changing IP issue is solvable for non-system-crash situations. Chimera
should register with the Kernel Event Monitor and the Preferences Monitor to
receive notifications of changing network conditions:

 - The System proxy settings should be reloaded when a change occurs
   (worth a separate bug, perhaps).
 - On receiving a kernel notification about an interface change, Chimera
   would still know both the old and the new address, enabling it to find
   the old pid file and to rewrite it with the new info.

Marko Karppinen

Comment 20

•

22 years ago

Okay, I can now replicate my problem with the profile locking. I use my
PowerBook both at work and at home, using the location manager to switch between
configurations. Each location has an unique IP address, but since I put my
PowerBook to sleep between the places instead of shutting it down, a single
Chimera instance can survive the trip.

If I start up Chimera at home and then crash it at work, the stale lock file
contains my home IP address and goes unrecognized. Chimera is forced to give up.

The browser needs to listen to interface changes and rewrite the lock file
accordingly.

Status: VERIFIED → REOPENED

Resolution: WONTFIX → ---

Conrad Carlen (not reading bugmail)

Assignee

Comment 21

•

22 years ago

This was done (using fcntl) by bug 176608 for Chimera and Mach-0 mozilla so it's
either fixed or invalid there. The remaining problem, for other platforms, is a
dup of bug 151188. Closing as a dup so a link to this bug and its patch ;-) will
show up there.

*** This bug has been marked as a duplicate of 151188 ***

Status: REOPENED → RESOLVED

Closed: 22 years ago → 22 years ago

Resolution: --- → DUPLICATE

K'Trina Medina

Comment 22

•

22 years ago

Verified as a dupe of bug 151188

Status: RESOLVED → VERIFIED