macOS and Linux need an update mutex
Categories
(Toolkit :: Application Update, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox134 | --- | fixed |
People
(Reporter: bytesized, Assigned: yannis)
References
(Regressed 1 open bug)
Details
(Whiteboard: [fidedi-ope])
Attachments
(3 files)
We have an update mutex that prevents multiple Firefox instances from updating simultaneously and interfering with one another. Interestingly, this mutex was actually added for a different reason and it was only incidental that it also fixed this issue. This probably explains why it was only ever implemented on Windows.
But this functionality is necessary across platforms. I was originally going to update Bug 1278252 to use for this, but I decided to create a new bug instead because that one is not quite right. It correctly identifies that multi-user updating is a problem, but fails to recognize that multi-profile updating causes all the same problems (and sometimes more).
To be a bit more clear, this is what the mutex is needed for: The files in the update directory represent the update state. We should never make changes to the update state from multiple Firefox instances at once. One instance gets to take the mutex successfully and all the others should refuse to update and allow the instance with the mutex to run the update process uninterrupted.
I recall that I looked into adding an update mutex for macOS once in the past and had some trouble finding a mechanism that fulfilled all the requirements. IIRC, the requirements were:
- Concurrency safety, of course. Exactly one instance should be able to hold it at once, regardless of the timing of when each instance attempts to take it.
- It should be file-based since what it is protecting is file-based. Ideally we would place it on the same filesystem as what we are protecting.
- If the Firefox instance holding the mutex crashes, the mutex must be released. Ideally it should be released immediately.
- Broad compatibility. We don't want this to fail just because someone is using a non-standard filesystem.
Someone raised the idea of using sqlite to do this. I never looked into how reasonable that was. Another option that was raised is nsProfileLock
. Reportedly there are some pretty rare issues with certain file systems, but I think this might still be our best option. I'm still looking into whether those issues are documented somewhere.
There is also a macOS specific issue here. macOS is the one platform that still uses per-user update directories. Migrating to a per-installation update directory is something that we want to do, but I know from experience that it's difficult and I'd rather not make it a prerequisite of this work. But if we put the macOS update mutex in the update directory (where it is on Windows), it won't actually solve Bug 1278252. I'm not aware of any directory that Firefox already uses that is installation specific. We might want to do something like /Library/Application Support/<bundle-id>/installHash/mutex
? I'm not totally sure.
But once we have answered the relevant questions, I believe the work should be pretty straightforward: Implement createMutex
for the remaining platforms, remove this check, and add some testing.
I'd like to quickly mention one related thing: nsIUpdateSyncManager
. This is also known as the Multi Instance Lock (MIL) and is very commonly confused with the update mutex. The MIL does exist on all platforms, but it does not work the same way or solve the same problem.
Confusingly, it does exist to address a very similar, very closely related problem that sometimes causes it to appear to fix this problem. When it detects other instances of Firefox running, it introduces very long delays into the update system. This mitigates Bug 1480452, but it also kind of makes it look a bit like this bug is fixed. But those delays are only temporary. Eventually Firefox will stop showing the "Another instances is updating" in the update UI, and allow manual or automatic updates to proceed, potentially causing this bug.
The MIL works using GNU file locking. Paraphrased from here:
A write lock gives a process exclusive access. While a write lock is held, no other process can lock the file at all.
A read lock prohibits any other process from requesting a write lock. However, other processes can request read locks.
This allows the MIL to do this:
- Very early in Firefox launch, take a read lock
- If we want to know if other instances are running, query the lock to find out if we could take a write lock (but don't actually take one).
- If we could take the lock, we are the only Firefox instance. If we could not, there are other instances running.
But this does not help determine which instance should drive update.
An obvious followup question is "why not use GNU file write locks for this?"
It's possible that it is reasonable to do so. We would need to look into how well that fulfills the requirements above. I'm not sure off the top of my head.
Updated•1 year ago
|
Comment 2•1 year ago
|
||
(In reply to Robin Steuber (they/them) [:bytesized] from comment #0)
Someone raised the idea of using sqlite to do this. I never looked into how reasonable that was. Another option that was raised is
nsProfileLock
. Reportedly there are some pretty rare issues with certain file systems, but I think this might still be our best option. I'm still looking into whether those issues are documented somewhere.
To elaborate on what nsProfileLock
does. Despite its name it allows taking an exclusive lock on any directory, it is not specific to the profile directory. It does so by creating a file in the directory and using one of a few strategies for taking an exclusive lock on it. It is used in early startup to only allow one Firefox instance to be in early startup at once (controlling the remoting service and profile selection such that only one instance can select a given profile). It is later used to lock the profile folder itself. It has been around for literally decades and so is battle tested.
On Windows it works by taking an OS level file lock by opening a file for exclusive writing. On linux and macOS it first attempts to use fcntl. If that fails because the filesystem doesn't support locking (networked filesystems generally) then it creates a symlink (which can be done so atomically). The symlink case is the only case that is not guaranteed to release the lock in the event of the process crashing. But there is still some protection, the target of the symlink is a special string that includes the pid of the process that takes the lock. When another process attempts to take the lock it verifies that a process with that pid exists, if it doesn't then it assumes the lock is stale and deletes it. This is not foolproof, we don't verify that the process is a Firefox process. There are probably improvements that could be made but there hasn't been enough of a need at this point as so few users use networked filesystems for their home directories. Possibly there are some filesystems that don't support fcntl or symlinks, locking would fail in this case.
Assignee | ||
Updated•9 months ago
|
Assignee | ||
Comment 3•9 months ago
•
|
||
Some initial remarks after discussing this issue and analyzing a bit how the current code works.
First, I'd like to correct that the current update mutex on Windows is not a file-based solution. It works by creating a uniquely named object (unique per installation, see getPerInstallationMutexName
) in the Windows kernel. This named object happens to be a mutex in the actual implementation (because we use CreateMutexW
) but it would work the same with any other named object (e.g. a named event, a named semaphore, a named pipe...) - we don't actually depend on the fact that the object is a mutex. Essentially, the ability to create the object because it does not exist is considered as we are now holding the update mutex, and the inability to create it because it already exists is considered as someone else is holding the update mutex.
So the current solution on Windows meets requirement 1 (If the mutex is a named mutex and the object existed before this function call, the return value is a handle to the existing object, and the GetLastError function returns ERROR_ALREADY_EXISTS
), requirement 3 (The system closes the handle automatically when the process terminates. The mutex object is destroyed when its last handle has been closed.) and requirement 4 (not file-based so works with all filesystems), but not it does not actually meet requirement 2 (not file-based).
I believe that we could replicate the same behavior on Linux and macOS by using e.g. sem_open
and also meet requirement 1 (If both O_CREAT
and O_EXCL
are specified in oflag, then an error is returned if a semaphore with the given name already exists.), requirement 3 (sem_close()
closes the named semaphore referred to by sem, allowing any resources that the system has allocated to the calling process for this semaphore to be freed., All open named semaphores are automatically closed on process termination, or upon execve
.) and requirement 4 (not file-based so works with all filesystems), but not requirement 2 (not file-based).
Could you explain in more detail what you think requirement 2 would bring to the table? Thank you.
Reporter | ||
Comment 4•9 months ago
|
||
It would be nice if it were file-based, since what we are protecting is file-based. That way if this directory is being shared in a way we don't predict, we still use the same lock file and we know that the scope of the lock is always correct. But it's not strictly required that it be file-based, only that it is shared amongst instances that share an update directory and not shared amongst instances that do not share an update directory.
The exact characteristics of the update directory vary a bit amongst OSes, so this makes it appealing to tie it to the update directory directly. On Windows, the update directory is per-installation. On Linux, I believe the update directory literally resides in the installation directory so it is also per-installation. On macOS however, I believe that the update directory is currently per-installation and per-user, but we may eventually need to change that to being strictly per-installation.
(In reply to Yannis Juglaret [:yannis] from comment #3)
I believe that we could replicate the same behavior on Linux and macOS by using e.g.
sem_open
Is there a reason that this is preferable to nsProfileLock
? It sounded like that met the requirements and is already pretty battle tested.
Assignee | ||
Comment 5•9 months ago
•
|
||
(In reply to Robin Steuber (she/her) [:bytesized] from comment #4)
Is there a reason that this is preferable to
nsProfileLock
? It sounded like that met the requirements and is already pretty battle tested.
Mostly that this looks like it could have been a more direct transposition of the Windows behavior, so the patch could have been more direct to write if that was acceptable. But now I understand better what requirement 2 would provide, so let's go with nsProfileLock
directly.
Do we want to move to nsProfileLock
on Windows as well? In that case -- do we need to care about transitioning? Is it possible to have an instance running the old update mutex code with CreateMutexW
while the new one would be using the new code with nsProfileLock
? It sounds like if we have to handle that, then nsProfileLock
can bring the extra layer of safety brought by requirement 2 (e.g. if symlinks are used) but we would still need to keep the old code around, is that correct?
Reporter | ||
Comment 6•9 months ago
|
||
(In reply to Yannis Juglaret [:yannis] from comment #5)
Do we want to move to
nsProfileLock
on Windows as well?
I would prefer that we do it this way, but I don't have an especially strong opinion. It just seems easier not to maintain an extra mechanism if we don't have to.
In that case -- do we need to care about transitioning? Is it possible to have an instance running the old update mutex code with
CreateMutexW
while the new one would be using the new code withnsProfileLock
?
It is likely technically possible for this to happen. I think, however, that the likelihood is so small that we shouldn't worry too much about it. If there is an instance of Firefox running an older version than what is on disk, it is quickly going to get a "Restart Required" page and need to be restarted to switch to the new version. And, during that time, update would most likely be locked out anyways because isOtherInstanceRunning
would return true, leading to a 6 hour update timeout. It is technically possible for a client to have an update conflict despite that, but it seems sufficiently unlikely to me and everything will basically be put back into a good state after the next restart regardless.
Comment 7•9 months ago
|
||
There is one edge case that is known to be a problem for nsProfileLock, this is on Linux where the directory being locked is on NFS (so fcntl doesn't work) and for some reason the DNS server is extremely slow at resolving the local hostname (no idea why). This causes locking to take a long time. This is tracked in bug 1901228. I have some ideas to fix it but there is some complexity. Folks seeing this are already seeing very slow startup so you might be able to consider this ignorable. Also depending on where you put the lock directory you may be able to assume that NFS is never involved.
Assignee | ||
Comment 8•9 months ago
|
||
This patch moves the update mutex related code to a standalone
UpdateMutex class that lives in UpdateUtils. This reduces code
duplication and avoids depending on the low-level details of the
implementation in multiple locations in the code base.
Assignee | ||
Comment 9•9 months ago
|
||
Updated•9 months ago
|
Updated•9 months ago
|
Comment 10•9 months ago
|
||
(In reply to Yannis Juglaret [:yannis] from comment #3)
I believe that we could replicate the same behavior on Linux and macOS by using e.g.
sem_open
and also meet requirement 1 (If bothO_CREAT
andO_EXCL
are specified in oflag, then an error is returned if a semaphore with the given name already exists.), requirement 3 (sem_close()
closes the named semaphore referred to by sem, allowing any resources that the system has allocated to the calling process for this semaphore to be freed., All open named semaphores are automatically closed on process termination, or uponexecve
.) and requirement 4 (not file-based so works with all filesystems), but not requirement 2 (not file-based).
It looks like it's not an issue anymore given the current patches, but for what it's worth, I don't think that would have worked for this use case. If I understand correctly, sem_open
may or may not be implemented with files (glibc's implementation is) but it has file-like semantics: it has to be explicitly sem_unlink
ed to free up the name, and sem_close
ing it is neither sufficient nor necessary for that. This is why fcntl
locking tends to be used — usually closing a file doesn't have globally visible side-effects, but that's one of the exceptions.
Updated•9 months ago
|
Updated•9 months ago
|
Assignee | ||
Comment 11•8 months ago
|
||
In order to properly test the update mutex, we need a test that
acquires it from a different process. This patch adapts the
canCheckForAndApplyUpdates.js test to do that. We achieve this by
refactoring and recycling the Subprocess helpers used in the
updateSyncManager.js test.
Depends on D222157
Updated•8 months ago
|
Updated•8 months ago
|
Updated•8 months ago
|
Updated•8 months ago
|
Assignee | ||
Updated•8 months ago
|
Comment 12•8 months ago
|
||
Assignee | ||
Comment 13•8 months ago
|
||
Requested partial landing based on code review -- only D222156 for now.
Comment 14•8 months ago
|
||
bugherder |
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Comment 15•7 months ago
|
||
Requested landing of the rest of the stack.
Comment 16•7 months ago
|
||
Comment 17•7 months ago
|
||
Assignee | ||
Comment 18•7 months ago
•
|
||
Thanks for the backout -- my mistake for adding the new component to all platforms when its implementation is conditionally defined. I found the proper place to add the component and will ask for landing again.
Comment 19•7 months ago
|
||
Comment 20•7 months ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/1d8c418deee9
https://hg.mozilla.org/mozilla-central/rev/9fc4d9e98325
Description
•