921063 - Simultaneous launches always produce "Firefox/Thunderbird is already running but is not responding" error (Linux)

Matthew T (active 1999-2002)

Reporter

Description

•

11 years ago

Firefox 24.0, Ubuntu 13.04
Thunderbird 24.0, Ubuntu 13.04
(and I vaguely recall this bug occurring as far back as Firebird 0.7 on Windows 98)

When Firefox or Thunderbird is early in the launch process, trying to launch it again always produces an error of the form: "{App Name} is already running, but is not responding. To open a new window, you must first close the existing {App Name} process, or restart your system."

Unlike some causes of this error, this is not a hang. The reason the app is not responding is merely that it hasn't finished launching yet.

Example steps to reproduce in Firefox:
1. Create two or more bookmarks in a file folder -- for example, by dragging these two links http://example.com/foo http://example.com/bar into a file manager window.
2. Close Firefox completely. If necessary, verify that it has closed using Task Manager, Activity Monitor, ps, or similar.
3. Select both bookmarks in the file folder, and choose "Open".
What happens: One bookmark opens in Firefox, but the other results in a "Firefox is not responding" error.

Example steps to reproduce in Thunderbird:
1. Create two or more e-mail links in a file folder -- for example, by dragging these two links example1@example.com example2Example.com into a file manager window.
2. Close Thunderbird completely. If necessary, verify that it has closed using Task Manager, Activity Monitor, ps, or similar.
3. Select both e-mail links in the file folder, and choose "Open".
What happens: A composition window opens for one address, but the other results in a "Thunderbird is not responding" error.
What should happen: One composition window opens with both addresses as recipients (matching Address Book behavior).

These examples are contrived to demonstrate the problem even on fast PCs. But the slower a PC is, the more often you will see the problem from launches that are not exactly simultaneous -- for example, from Firefox when you rapidly click two links in a mailer or Twitter client.

This bug report is to launching as bug 759206 is to quitting, and the solutions to both bugs *might* share some interface design and/or code.

Matthew T (active 1999-2002)

Reporter

Comment 1

•

11 years ago

(An unmangled second e-mail link for your dragging convenience: example2@example.com .)

Benjamin Smedberg

Comment 2

•

11 years ago

I believe that this is specific to Linux. On Windows, the process of starting remoting uses a systemwide mutex:

http://hg.mozilla.org/mozilla-central/annotate/64b497e6f593/toolkit/xre/nsNativeAppSupportWin.cpp#l98

This named mutex means that the process of creating the server window for remoting is always an atomic operation.

On Linux I am not aware of an equivalent cross-process mutex (corrections/suggestions welcome!) for an X session. So the startup sequence looks like this:

* look for a remote instance - http://hg.mozilla.org/mozilla-central/annotate/64b497e6f593/toolkit/xre/nsAppRunner.cpp#l3351
* If a remote instance is found, send the command line to it and quit
* If not, start the profile normally and use this instance
* Which, if we're racing two processes, hits this line, can't lock the profile because the other instance has won, and shows the dialog - http://hg.mozilla.org/mozilla-central/annotate/64b497e6f593/toolkit/xre/nsAppRunner.cpp#l3517

So if there is a cross-process named mutex for X/Linux, we could use that and avoid the race. Otherwise we'll have to do something like bug 921046 where we do a small loop of remoting/profile startup until one of them succeeds. I do not have time to fix this, but I will happily mentor somebody who wants to take it.

Severity: normal → enhancement

Keywords: helpwanted

Priority: -- → P3

Summary: Simultaneous launches always produce "Firefox/Thunderbird is already running but is not responding" error → Simultaneous launches always produce "Firefox/Thunderbird is already running but is not responding" error (Linux)

Whiteboard: [mentor=benjamin@smedbergs.us][lang=c++]

handa.vikalp

Comment 3

•

11 years ago

Hi Benjamin..I am a C++ developer and a beginner to Mozilla Community..I would like to start my journey by solving this bug..kindly guide me through this.

Loic

Comment 5

•

11 years ago

Is it a regression? Why is it reported only now?

Luca Niccoli

Comment 6

•

10 years ago

Attached file A sample systemwide mutex for linux (obsolete) — Details

Hi,
I'd like to work on this bug.
There are ways to have system wide mutexes on Linux, here attached is a basic one using a named semaphore.
I see some problems  with wrapping the critical section of the code in a mutex though:
- If the application crashes or is terminated while holding the mutex, it will be impossible to start a new instance.
- The point in which an existing instance of firefox is looked up and the one in which the new instance starts listening for new window requests are very far apart, which makes it somewhat complicated to ensure that the mutex is released in every code path.

The first problem could be mitigated by ignoring the fact that the mutex cannot be locked after a timeout (if an other instance is actually running, the startup will fail as usual when trying to lock the profile; if the profile can be locked, than no other instance is running and the semaphore can be reset to 1)
The second problem could be mitigated by moving remote lookup as close as possible to the remote service startup, but this does not seem so simple, as the remote service is started here in XREMain::XRE_mainRun while the remote lookup and the profile locking are both executed in XREMain::XRE_mainStartup.

Benjamin, are you still available to mentor this bug?
Do you think the mutex way is feasible?

Flags: needinfo?(benjamin)

Benjamin Smedberg

Comment 7

•

10 years ago

I'm not the Linux system programming expert, but the fact that a semaphore remains locked even when the process holding it crashes or exits makes it not very useful. Isn't there something like a window mutex which is held by a process but automatically released on crash/termination?

I'm not so worried about the case of normal startup and that mainRun and mainStartup are bit far apart: we can use a global within the startup system to guarantee that the mutex gets unlocked.

Flags: needinfo?(benjamin)

Luca Niccoli

Comment 8

•

10 years ago

Attached file File based system wide mutex (obsolete) — Details

We could use a mutex based on a file lock, like the one I just attached.
The downside is that there is no timeout in flock, so one has to spin (like I did in the example) or set up an alarm and a signal handler to ensure that locking fails after a certain amount of time. Also this kind of mutex does not work between threads in the same process, but this should not be a problem for this specific purpose.

The file backing this lock could be in the user profile directory.

Benjamin Smedberg

Comment 9

•

10 years ago

The profile lock is already a file lock and it's atomic. Since the mutexes don't seem to be a great option here, I think it would be better to focus on a looping solution like bug 921046:

* try remoting
* if no remoting, try locking and starting the profile
* if profile is locked, go back and try remoting
* retry profile locking
* only now report that the profile is locked

Luca Niccoli

Comment 10

•

10 years ago

But the profile lock, if I'm not mistaken, is held for as long as firefox runs, so it does not make sense to wait on it before trying remoting: if the new instance can acquire it, then the old instance has exited and there is no possibility for remoting.

The new start up lock would instead be acquired just before looking for a remote and released as soon as the remoting service is started; at that point a second instance of firefox would be allowed to acquire it and to look for existing remotes.

I think the looping solution has a problem on its own: the second instance of firefox could be able to go through the whole loop before the first instance has been able to start the remoting service (which is done quite later than locking the profile).

Luca Niccoli

Comment 11

•

10 years ago

Attached patch startup.patch (obsolete) — Details — Splinter Review

Hi,

I gave a shot at the file-based mutex, this is the result.
It seems to be working correctly; I'd be glad if you could look at the patch and give me some feedback

Attachment #8379292 - Flags: review?(benjamin)

Benjamin Smedberg

Comment 12

•

10 years ago

Comment on attachment 8379292 [details] [diff] [review]
startup.patch

Why is the mutex file in /tmp instead of in the profile directory itself?

Why is it a fatal error if we fail to lock or unlock the mutex?

We need to measure at least using telemetry the amount of time we spending waiting for this lock.

There's a bunch of whitespace/style stuff that needs to be fixed: see https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style

I feel like the XRemoteMutex class ought to go somewhere else: maybe in XRemoteClient.h or a new header? nsAppRunner.cpp is already way too big for its own good.

flock isn't always reliable. http://mxr.mozilla.org/mozilla-central/source/profile/dirserviceprovider/src/nsProfileLock.cpp#532 uses flock and then falls back to symlinks. Is it possible to refactor that code to use it for both this lock and the profile lock?

Attachment #8379292 - Flags: review?(benjamin)

Luca Niccoli

Comment 13

•

10 years ago

Hi Benjamin, some answers to your comments below:

The mutex file is in /tmp because when the startup lock is acquired the profile has not been chosen yet (and it's not clear to me if the XRemote locking/lookup can be postponed after the profile choice).

It would probably sensible to go on with a warning if the lock cannot be acquired: we just revert to the previous behaviour.

An uncontended file lock on a local file system (as is the case with /tmp) should take less than a millisecond; in the contended case the second instance has obviously to wait until the first has started up, but there's no way around it. I will look into Telemetry; I guess also adding a boolean probe for the case where we cannot acquire the startup lock would be sensible.

flock is reliable on a local file system (as one can assume /tmp is), it's with NFS that causes troubles; anyway I've given a look at nsProfileLock and I think it could be used in place of XRemoteMutex, so I will try again using it which would seem the best solution. This would involve creating a directory under /tmp/  and using nsProfileLock on it. I will also take care of the style issues.

Luca Niccoli

Updated

•

10 years ago

Depends on: 977786

Luca Niccoli

Updated

•

10 years ago

Depends on: 977803

Luca Niccoli

Comment 14

•

10 years ago

Attached patch Startup mutexing with nsProfileLock (obsolete) — Details — Splinter Review

This patch uses nsProfileLock for the startup mutex; I had to make its header exported since it was not.
It's still missing the telemetry stuff, I'll go ahead with that when the patch is otherwise sound.
It depends on https://bugzilla.mozilla.org/attachment.cgi?id=8383254 and https://bug977786.bugzilla.mozilla.org/attachment.cgi?id=8383246 for nsProfileLock to work correctly.

Attachment #8377488 - Attachment is obsolete: true

Attachment #8378173 - Attachment is obsolete: true

Attachment #8379292 - Attachment is obsolete: true

Attachment #8383310 - Flags: review?(benjamin)

Neil Deakin

Updated

•

10 years ago

Blocks: 982324

Marco Mucci [:MarcoM]

Updated

•

10 years ago

Flags: firefox-backlog?

Marco Mucci [:MarcoM]

Updated

•

10 years ago

Flags: firefox-backlog? → firefox-backlog+

Whiteboard: [mentor=benjamin@smedbergs.us][lang=c++] → [mentor=benjamin@smedbergs.us][lang=c++] p=0

Nobody; OK to take it and work on it

Assignee

Updated

•

10 years ago

Mentor: benjamin

Whiteboard: [mentor=benjamin@smedbergs.us][lang=c++] p=0 → [lang=c++] p=0

Marco Mucci [:MarcoM]

Updated

•

10 years ago

Whiteboard: [lang=c++] p=0 → [lang=c++]

Jim Mathies [:jimm]

Updated

•

8 years ago

Attachment #8383310 - Flags: review?(benjamin) → review?(gpascutto)

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/76964/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/76964/

Gian-Carlo Pascutto [:gcp]

Updated

•

8 years ago

Attachment #8383310 - Attachment is obsolete: true

Attachment #8383310 - Flags: review?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Comment 16

•

8 years ago

Rebased the patch, cleaned up the error handling and some minor nits. I've r? glandium to take a look at my changes: he's worked on -remote, knows Unix well, and has review turnaround times shorter than 2 years, generally.

Mike Hommey [:glandium]

Comment 17

•

8 years ago

mozreview-review

Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

https://reviewboard.mozilla.org/r/76964/#review75268

::: toolkit/xre/nsAppRunner.cpp:3702
(Diff revision 1)
>    } else {
>      mDisableRemote = true;
>    }
>  #endif
>  #ifdef MOZ_ENABLE_XREMOTE
> +  nsAutoCString mutexPath = NS_LITERAL_CSTRING("/tmp/")

Start from:

nsCOMPtr<nsIFile> mutexDir;
rv = NS_GetSpecialDirectory(NS_OS_TEMP_DIR, getter_AddRefs(mutexDir));

instead of hardcoding /tmp.

::: toolkit/xre/nsAppRunner.cpp:3703
(Diff revision 1)
> +    + nsDependentCString(gAppData->name)
> +    + NS_LITERAL_CSTRING("_")
> +    + nsDependentCString(getenv("LOGNAME"));

The XRemote client part is kind of generic, and the RemoteCommandLine function, called later below actually takes command line arguments into account for the application and user names, so gAppData->name and LOGNAME are not the right unique identifier when such command line arguments are given.

::: toolkit/xre/nsAppRunner.cpp:3713
(Diff revision 1)
> +      rv = mutexDir->Create(nsIFile::DIRECTORY_TYPE, 0700);
> +      if (NS_SUCCEEDED(rv)) {

Since we're talking about race conditions, creating the mutexDir can fail here because the other racing process created it after this process checked it didn't exist but before the mutexDir->Create. Which a) makes the Exists check useless b) makes dirExists wrong in that case.

Just don't check for Exists, and check whether the Create succeeded or returned NS_ERROR_FILE_ALREADY_EXISTS.

::: toolkit/xre/nsAppRunner.cpp:3726
(Diff revision 1)
> +    int timeout = MOZ_XREMOTE_START_TIMEOUT_SEC;
> +    do {
> +      rv = mRemoteLock.Lock(mutexDir, nullptr);
> +      if (NS_SUCCEEDED(rv))
> +        break;
> +      sleep(1);

1 second is a lot of time to be waiting for for each cycle. And 30 seconds is a hell of a lot of time to be waiting for the lock, especially when chances are the XRemote client is going to wait on its own after that.

::: toolkit/xre/nsAppRunner.cpp:3735
(Diff revision 1)
> +    NS_WARNING("Cannot lock XRemote start mutex");
> +
>    // handle --remote now that xpcom is fired up
>    bool newInstance;
>    {
>      char *e = PR_GetEnv("MOZ_NO_REMOTE");

See this check here? You don't want to be using the "global mutex" in the case XRemote is disabled.

::: toolkit/xre/nsAppRunner.cpp:4322
(Diff revision 1)
>      // proxy window.
>      if (!mDisableRemote)
>        mRemoteService = do_GetService("@mozilla.org/toolkit/remote-service;1");
>      if (mRemoteService)
>        mRemoteService->Startup(mAppData->remotingName, mProfileName.get());
> +    mRemoteLock.Unlock();

The directory should be removed as well.

With all that being said, I'm not particularly convinced this should be fixed. Not because this is not a bug that needs fixing, but because it doesn't feel to me it's worth adding such complexity without first discussing about the future of XRemote. Because XRemote doesn't work with wayland, so it's about time we had a non-X11-bound remoting service for Unix. And chances are, a non-X11-bound remoting service wouldn't have the race condition this patch is trying to fix in the first place, because it would be based on unix sockets or whatever.

Attachment #8788462 - Flags: review?(mh+mozilla)

Gian-Carlo Pascutto [:gcp]

Comment 18

•

8 years ago

>I'm not particularly convinced this should be fixed.

I don't think "we may need to discuss a possible replacement for this and when (if) we ever implement it it may remove the need for this patch" is even close to an argument not to fix something.

If there were non-X11-remoting patches close to landing, maybe. Right now? Not so much.

Mike Hommey [:glandium]

Comment 19

•

8 years ago

As said on IRC, this bug has existed for 3 or 4 times as long as this bug has been open. I won't stop anyone from fixing it, but all things considered, I think time would be better spent addressing the more long-term issue.

Gian-Carlo Pascutto [:gcp]

Comment 20

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #17)
> Start from:
> 
> nsCOMPtr<nsIFile> mutexDir;
> rv = NS_GetSpecialDirectory(NS_OS_TEMP_DIR, getter_AddRefs(mutexDir));
> 
> instead of hardcoding /tmp.

This doesn't work because XPCOM and the nsComponentManager which provides NS_DIRECTORY_SERVICE aren't available at that point yet.

Comment hidden (mozreview-request)

Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/76964/diff/1-2/

Mike Hommey [:glandium]

Comment 22

•

8 years ago

mozreview-review

Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

https://reviewboard.mozilla.org/r/76964/#review77198

::: toolkit/xre/nsAppRunner.cpp:3725
(Diff revision 2)
>        e = PR_GetEnv("MOZ_NEW_INSTANCE");
>        newInstance = (e && *e);
>      }
>    }
>  
> +  if (!mDisableRemote) {

Come to think of it, I'm not entirely sure if it shouldn't be !newInstance.

::: toolkit/xre/nsAppRunner.cpp:3754
(Diff revision 2)
> +        mRemoteLockDir = mutexDir;
> +      }
> +    }
> +
> +    if (mRemoteLockDir) {
> +      PRIntervalTime epoch = PR_IntervalNow();

It would be better to use mozilla::TimeStamp here. I'd also go with sched_yield() from sched.h instead of sleeping.

Attachment #8788462 - Flags: review?(mh+mozilla)

Mike Hommey [:glandium]

Comment 23

•

8 years ago

mozreview-review

Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

https://reviewboard.mozilla.org/r/76964/#review77200

Gian-Carlo Pascutto [:gcp]

Comment 24

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #22)
> Comment on attachment 8788462 [details]
> Bug 921063 - Lock and (potentially) wait for remote service startup.
> 
> https://reviewboard.mozilla.org/r/76964/#review77198
> 
> ::: toolkit/xre/nsAppRunner.cpp:3725
> (Diff revision 2)
> >        e = PR_GetEnv("MOZ_NEW_INSTANCE");
> >        newInstance = (e && *e);
> >      }
> >    }
> >  
> > +  if (!mDisableRemote) {
> 
> Come to think of it, I'm not entirely sure if it shouldn't be !newInstance.
> 

The flags are close, but the difference is that it's possible to have !mDisableRemote yet newInstance. In this case we do not want to try to get the remoting lock (as we don't care for it). So you're probably right.

Comment hidden (mozreview-request)

Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/76964/diff/2-3/

Mike Hommey [:glandium]

Comment 26

•

8 years ago

mozreview-review

Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

https://reviewboard.mozilla.org/r/76964/#review78344

::: toolkit/xre/nsAppRunner.cpp:3758
(Diff revisions 2 - 3)
> -      PRIntervalTime epoch = PR_IntervalNow();
> +      const TimeStamp epoch = mozilla::TimeStamp::Now();
>        do {
>          rv = mRemoteLock.Lock(mRemoteLockDir, nullptr);
>          if (NS_SUCCEEDED(rv))
>            break;
> -        PR_Sleep(PR_MillisecondsToInterval(200));
> +        sched_yield();

I'm surprised you don't need to add a #include <sched.h> somewhere. I'd still add one in case it works somehow on Linux but breaks BSDs.

Attachment #8788462 - Flags: review?(mh+mozilla) → review+

Comment hidden (mozreview-request)

Comment on attachment 8788462 [details]
Bug 921063 - Lock and (potentially) wait for remote service startup.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/76964/diff/3-4/

Pulsebot

Comment 28

•

8 years ago

Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a947ab8f930d
Lock and (potentially) wait for remote service startup. r=glandium

Carsten Book [:Tomcat]

Comment 29

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/a947ab8f930d

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox52: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Julien Cristau [:jcristau]

Updated

•

7 years ago

Depends on: 1345413

Mike Hommey [:glandium]

Updated

•

7 years ago

Depends on: 1347358

A sample systemwide mutex for linux 10 years ago Luca Niccoli 1.28 KB, text/x-c++src		Details
File based system wide mutex 10 years ago Luca Niccoli 1.06 KB, text/x-c++src		Details
startup.patch 10 years ago Luca Niccoli 4.34 KB, patch		Details \| Diff \| Splinter Review
Startup mutexing with nsProfileLock 10 years ago Luca Niccoli 9.77 KB, patch		Details \| Diff \| Splinter Review
Bug 921063 - Lock and (potentially) wait for remote service startup. 8 years ago Gian-Carlo Pascutto [:gcp] 58 bytes, text/x-review-board-request	glandium : review+	Details