Closed Bug 921063 Opened 11 years ago Closed 8 years ago

Simultaneous launches always produce "Firefox/Thunderbird is already running but is not responding" error (Linux)

Categories

(Toolkit :: Startup and Profile System, enhancement, P3)

24 Branch
x86
Linux
enhancement

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: mpt, Unassigned, Mentored)

References

Details

(Keywords: helpwanted, Whiteboard: [lang=c++])

Attachments

(1 file, 4 obsolete files)

Firefox 24.0, Ubuntu 13.04
Thunderbird 24.0, Ubuntu 13.04
(and I vaguely recall this bug occurring as far back as Firebird 0.7 on Windows 98)


When Firefox or Thunderbird is early in the launch process, trying to launch it again always produces an error of the form: "{App Name} is already running, but is not responding. To open a new window, you must first close the existing {App Name} process, or restart your system."

Unlike some causes of this error, this is not a hang. The reason the app is not responding is merely that it hasn't finished launching yet.

Example steps to reproduce in Firefox:
1. Create two or more bookmarks in a file folder -- for example, by dragging these two links http://example.com/foo http://example.com/bar into a file manager window.
2. Close Firefox completely. If necessary, verify that it has closed using Task Manager, Activity Monitor, ps, or similar.
3. Select both bookmarks in the file folder, and choose "Open".
What happens: One bookmark opens in Firefox, but the other results in a "Firefox is not responding" error.

Example steps to reproduce in Thunderbird:
1. Create two or more e-mail links in a file folder -- for example, by dragging these two links example1@example.com example2Example.com into a file manager window.
2. Close Thunderbird completely. If necessary, verify that it has closed using Task Manager, Activity Monitor, ps, or similar.
3. Select both e-mail links in the file folder, and choose "Open".
What happens: A composition window opens for one address, but the other results in a "Thunderbird is not responding" error.
What should happen: One composition window opens with both addresses as recipients (matching Address Book behavior).

These examples are contrived to demonstrate the problem even on fast PCs. But the slower a PC is, the more often you will see the problem from launches that are not exactly simultaneous -- for example, from Firefox when you rapidly click two links in a mailer or Twitter client.

This bug report is to launching as bug 759206 is to quitting, and the solutions to both bugs *might* share some interface design and/or code.
(An unmangled second e-mail link for your dragging convenience: example2@example.com .)
I believe that this is specific to Linux. On Windows, the process of starting remoting uses a systemwide mutex:

http://hg.mozilla.org/mozilla-central/annotate/64b497e6f593/toolkit/xre/nsNativeAppSupportWin.cpp#l98

This named mutex means that the process of creating the server window for remoting is always an atomic operation.

On Linux I am not aware of an equivalent cross-process mutex (corrections/suggestions welcome!) for an X session. So the startup sequence looks like this:

* look for a remote instance - http://hg.mozilla.org/mozilla-central/annotate/64b497e6f593/toolkit/xre/nsAppRunner.cpp#l3351
* If a remote instance is found, send the command line to it and quit
* If not, start the profile normally and use this instance
* Which, if we're racing two processes, hits this line, can't lock the profile because the other instance has won, and shows the dialog - http://hg.mozilla.org/mozilla-central/annotate/64b497e6f593/toolkit/xre/nsAppRunner.cpp#l3517

So if there is a cross-process named mutex for X/Linux, we could use that and avoid the race. Otherwise we'll have to do something like bug 921046 where we do a small loop of remoting/profile startup until one of them succeeds. I do not have time to fix this, but I will happily mentor somebody who wants to take it.
Severity: normal → enhancement
Keywords: helpwanted
Priority: -- → P3
Summary: Simultaneous launches always produce "Firefox/Thunderbird is already running but is not responding" error → Simultaneous launches always produce "Firefox/Thunderbird is already running but is not responding" error (Linux)
Whiteboard: [mentor=benjamin@smedbergs.us][lang=c++]
Hi Benjamin..I am a C++ developer and a beginner to Mozilla Community..I would like to start my journey by solving this bug..kindly guide me through this.
Is it a regression? Why is it reported only now?
Attached file A sample systemwide mutex for linux (obsolete) —
Hi,
I'd like to work on this bug.
There are ways to have system wide mutexes on Linux, here attached is a basic one using a named semaphore.
I see some problems  with wrapping the critical section of the code in a mutex though:
- If the application crashes or is terminated while holding the mutex, it will be impossible to start a new instance.
- The point in which an existing instance of firefox is looked up and the one in which the new instance starts listening for new window requests are very far apart, which makes it somewhat complicated to ensure that the mutex is released in every code path.

The first problem could be mitigated by ignoring the fact that the mutex cannot be locked after a timeout (if an other instance is actually running, the startup will fail as usual when trying to lock the profile; if the profile can be locked, than no other instance is running and the semaphore can be reset to 1)
The second problem could be mitigated by moving remote lookup as close as possible to the remote service startup, but this does not seem so simple, as the remote service is started here in XREMain::XRE_mainRun while the remote lookup and the profile locking are both executed in XREMain::XRE_mainStartup.

Benjamin, are you still available to mentor this bug?
Do you think the mutex way is feasible?
Flags: needinfo?(benjamin)
I'm not the Linux system programming expert, but the fact that a semaphore remains locked even when the process holding it crashes or exits makes it not very useful. Isn't there something like a window mutex which is held by a process but automatically released on crash/termination?

I'm not so worried about the case of normal startup and that mainRun and mainStartup are bit far apart: we can use a global within the startup system to guarantee that the mutex gets unlocked.
Flags: needinfo?(benjamin)
Attached file File based system wide mutex (obsolete) —
We could use a mutex based on a file lock, like the one I just attached.
The downside is that there is no timeout in flock, so one has to spin (like I did in the example) or set up an alarm and a signal handler to ensure that locking fails after a certain amount of time. Also this kind of mutex does not work between threads in the same process, but this should not be a problem for this specific purpose.

The file backing this lock could be in the user profile directory.
The profile lock is already a file lock and it's atomic. Since the mutexes don't seem to be a great option here, I think it would be better to focus on a looping solution like bug 921046:

* try remoting
* if no remoting, try locking and starting the profile
* if profile is locked, go back and try remoting
* retry profile locking
* only now report that the profile is locked
But the profile lock, if I'm not mistaken, is held for as long as firefox runs, so it does not make sense to wait on it before trying remoting: if the new instance can acquire it, then the old instance has exited and there is no possibility for remoting.

The new start up lock would instead be acquired just before looking for a remote and released as soon as the remoting service is started; at that point a second instance of firefox would be allowed to acquire it and to look for existing remotes.

I think the looping solution has a problem on its own: the second instance of firefox could be able to go through the whole loop before the first instance has been able to start the remoting service (which is done quite later than locking the profile).
Attached patch startup.patch (obsolete) — Splinter Review
Hi,

I gave a shot at the file-based mutex, this is the result.
It seems to be working correctly; I'd be glad if you could look at the patch and give me some feedback
Attachment #8379292 - Flags: review?(benjamin)
Comment on attachment 8379292 [details] [diff] [review]
startup.patch

Why is the mutex file in /tmp instead of in the profile directory itself?

Why is it a fatal error if we fail to lock or unlock the mutex?

We need to measure at least using telemetry the amount of time we spending waiting for this lock.

There's a bunch of whitespace/style stuff that needs to be fixed: see https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style

I feel like the XRemoteMutex class ought to go somewhere else: maybe in XRemoteClient.h or a new header? nsAppRunner.cpp is already way too big for its own good.

flock isn't always reliable. http://mxr.mozilla.org/mozilla-central/source/profile/dirserviceprovider/src/nsProfileLock.cpp#532 uses flock and then falls back to symlinks. Is it possible to refactor that code to use it for both this lock and the profile lock?
Attachment #8379292 - Flags: review?(benjamin)
Hi Benjamin, some answers to your comments below:

The mutex file is in /tmp because when the startup lock is acquired the profile has not been chosen yet (and it's not clear to me if the XRemote locking/lookup can be postponed after the profile choice).

It would probably sensible to go on with a warning if the lock cannot be acquired: we just revert to the previous behaviour.

An uncontended file lock on a local file system (as is the case with /tmp) should take less than a millisecond; in the contended case the second instance has obviously to wait until the first has started up, but there's no way around it. I will look into Telemetry; I guess also adding a boolean probe for the case where we cannot acquire the startup lock would be sensible.

flock is reliable on a local file system (as one can assume /tmp is), it's with NFS that causes troubles; anyway I've given a look at nsProfileLock and I think it could be used in place of XRemoteMutex, so I will try again using it which would seem the best solution. This would involve creating a directory under /tmp/  and using nsProfileLock on it. I will also take care of the style issues.
Depends on: 977786
Depends on: 977803
This patch uses nsProfileLock for the startup mutex; I had to make its header exported since it was not.
It's still missing the telemetry stuff, I'll go ahead with that when the patch is otherwise sound.
It depends on https://bugzilla.mozilla.org/attachment.cgi?id=8383254 and https://bug977786.bugzilla.mozilla.org/attachment.cgi?id=8383246 for nsProfileLock to work correctly.
Attachment #8377488 - Attachment is obsolete: true
Attachment #8378173 - Attachment is obsolete: true
Attachment #8379292 - Attachment is obsolete: true
Attachment #8383310 - Flags: review?(benjamin)
Blocks: 982324
Flags: firefox-backlog?
Flags: firefox-backlog? → firefox-backlog+
Whiteboard: [mentor=benjamin@smedbergs.us][lang=c++] → [mentor=benjamin@smedbergs.us][lang=c++] p=0
Mentor: benjamin
Whiteboard: [mentor=benjamin@smedbergs.us][lang=c++] p=0 → [lang=c++] p=0
Whiteboard: [lang=c++] p=0 → [lang=c++]
Attachment #8383310 - Flags: review?(benjamin) → review?(gpascutto)
Attachment #8383310 - Attachment is obsolete: true
Attachment #8383310 - Flags: review?(gpascutto)
Rebased the patch, cleaned up the error handling and some minor nits. I've r? glandium to take a look at my changes: he's worked on -remote, knows Unix well, and has review turnaround times shorter than 2 years, generally.
Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

https://reviewboard.mozilla.org/r/76964/#review75268

::: toolkit/xre/nsAppRunner.cpp:3702
(Diff revision 1)
>    } else {
>      mDisableRemote = true;
>    }
>  #endif
>  #ifdef MOZ_ENABLE_XREMOTE
> +  nsAutoCString mutexPath = NS_LITERAL_CSTRING("/tmp/")

Start from:

nsCOMPtr<nsIFile> mutexDir;
rv = NS_GetSpecialDirectory(NS_OS_TEMP_DIR, getter_AddRefs(mutexDir));

instead of hardcoding /tmp.

::: toolkit/xre/nsAppRunner.cpp:3703
(Diff revision 1)
> +    + nsDependentCString(gAppData->name)
> +    + NS_LITERAL_CSTRING("_")
> +    + nsDependentCString(getenv("LOGNAME"));

The XRemote client part is kind of generic, and the RemoteCommandLine function, called later below actually takes command line arguments into account for the application and user names, so gAppData->name and LOGNAME are not the right unique identifier when such command line arguments are given.

::: toolkit/xre/nsAppRunner.cpp:3713
(Diff revision 1)
> +      rv = mutexDir->Create(nsIFile::DIRECTORY_TYPE, 0700);
> +      if (NS_SUCCEEDED(rv)) {

Since we're talking about race conditions, creating the mutexDir can fail here because the other racing process created it after this process checked it didn't exist but before the mutexDir->Create. Which a) makes the Exists check useless b) makes dirExists wrong in that case.

Just don't check for Exists, and check whether the Create succeeded or returned NS_ERROR_FILE_ALREADY_EXISTS.

::: toolkit/xre/nsAppRunner.cpp:3726
(Diff revision 1)
> +    int timeout = MOZ_XREMOTE_START_TIMEOUT_SEC;
> +    do {
> +      rv = mRemoteLock.Lock(mutexDir, nullptr);
> +      if (NS_SUCCEEDED(rv))
> +        break;
> +      sleep(1);

1 second is a lot of time to be waiting for for each cycle. And 30 seconds is a hell of a lot of time to be waiting for the lock, especially when chances are the XRemote client is going to wait on its own after that.

::: toolkit/xre/nsAppRunner.cpp:3735
(Diff revision 1)
> +    NS_WARNING("Cannot lock XRemote start mutex");
> +
>    // handle --remote now that xpcom is fired up
>    bool newInstance;
>    {
>      char *e = PR_GetEnv("MOZ_NO_REMOTE");

See this check here? You don't want to be using the "global mutex" in the case XRemote is disabled.

::: toolkit/xre/nsAppRunner.cpp:4322
(Diff revision 1)
>      // proxy window.
>      if (!mDisableRemote)
>        mRemoteService = do_GetService("@mozilla.org/toolkit/remote-service;1");
>      if (mRemoteService)
>        mRemoteService->Startup(mAppData->remotingName, mProfileName.get());
> +    mRemoteLock.Unlock();

The directory should be removed as well.

With all that being said, I'm not particularly convinced this should be fixed. Not because this is not a bug that needs fixing, but because it doesn't feel to me it's worth adding such complexity without first discussing about the future of XRemote. Because XRemote doesn't work with wayland, so it's about time we had a non-X11-bound remoting service for Unix. And chances are, a non-X11-bound remoting service wouldn't have the race condition this patch is trying to fix in the first place, because it would be based on unix sockets or whatever.
Attachment #8788462 - Flags: review?(mh+mozilla)
>I'm not particularly convinced this should be fixed.

I don't think "we may need to discuss a possible replacement for this and when (if) we ever implement it it may remove the need for this patch" is even close to an argument not to fix something.

If there were non-X11-remoting patches close to landing, maybe. Right now? Not so much.
As said on IRC, this bug has existed for 3 or 4 times as long as this bug has been open. I won't stop anyone from fixing it, but all things considered, I think time would be better spent addressing the more long-term issue.
(In reply to Mike Hommey [:glandium] from comment #17)
> Start from:
> 
> nsCOMPtr<nsIFile> mutexDir;
> rv = NS_GetSpecialDirectory(NS_OS_TEMP_DIR, getter_AddRefs(mutexDir));
> 
> instead of hardcoding /tmp.

This doesn't work because XPCOM and the nsComponentManager which provides NS_DIRECTORY_SERVICE aren't available at that point yet.
Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

https://reviewboard.mozilla.org/r/76964/#review77198

::: toolkit/xre/nsAppRunner.cpp:3725
(Diff revision 2)
>        e = PR_GetEnv("MOZ_NEW_INSTANCE");
>        newInstance = (e && *e);
>      }
>    }
>  
> +  if (!mDisableRemote) {

Come to think of it, I'm not entirely sure if it shouldn't be !newInstance.

::: toolkit/xre/nsAppRunner.cpp:3754
(Diff revision 2)
> +        mRemoteLockDir = mutexDir;
> +      }
> +    }
> +
> +    if (mRemoteLockDir) {
> +      PRIntervalTime epoch = PR_IntervalNow();

It would be better to use mozilla::TimeStamp here. I'd also go with sched_yield() from sched.h instead of sleeping.
Attachment #8788462 - Flags: review?(mh+mozilla)
Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

https://reviewboard.mozilla.org/r/76964/#review77200
(In reply to Mike Hommey [:glandium] from comment #22)
> Comment on attachment 8788462 [details]
> Bug 921063 - Lock and (potentially) wait for remote service startup.
> 
> https://reviewboard.mozilla.org/r/76964/#review77198
> 
> ::: toolkit/xre/nsAppRunner.cpp:3725
> (Diff revision 2)
> >        e = PR_GetEnv("MOZ_NEW_INSTANCE");
> >        newInstance = (e && *e);
> >      }
> >    }
> >  
> > +  if (!mDisableRemote) {
> 
> Come to think of it, I'm not entirely sure if it shouldn't be !newInstance.
> 

The flags are close, but the difference is that it's possible to have !mDisableRemote yet newInstance. In this case we do not want to try to get the remoting lock (as we don't care for it). So you're probably right.
Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

https://reviewboard.mozilla.org/r/76964/#review78344

::: toolkit/xre/nsAppRunner.cpp:3758
(Diff revisions 2 - 3)
> -      PRIntervalTime epoch = PR_IntervalNow();
> +      const TimeStamp epoch = mozilla::TimeStamp::Now();
>        do {
>          rv = mRemoteLock.Lock(mRemoteLockDir, nullptr);
>          if (NS_SUCCEEDED(rv))
>            break;
> -        PR_Sleep(PR_MillisecondsToInterval(200));
> +        sched_yield();

I'm surprised you don't need to add a #include <sched.h> somewhere. I'd still add one in case it works somehow on Linux but breaks BSDs.
Attachment #8788462 - Flags: review?(mh+mozilla) → review+
Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a947ab8f930d
Lock and (potentially) wait for remote service startup. r=glandium
https://hg.mozilla.org/mozilla-central/rev/a947ab8f930d
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Depends on: 1345413
Depends on: 1347358
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: