Profiler livelock caused by SA_RESTART restarting fork() syscall

VERIFIED FIXED in Firefox 23

Status

()

Core
Gecko Profiler
--
critical
VERIFIED FIXED
5 years ago
5 years ago

People

(Reporter: gerv, Assigned: jseward)

Tracking

({hang})

Trunk
mozilla23
All
Linux
Points:
---

Firefox Tracking Flags

(firefox20 affected, firefox21- affected, firefox22- affected, firefox23- verified, firefox-esr17 unaffected)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

5 years ago
When enabled, the profiler causes serious performance regression when opening files from Download Manager. It can take up to 10 minutes (!) for a file to open in the external application. Turning off the profiler returns the time to a fraction of a second.

Mike asked me to include the following:

(gdb) bt
#0  0xb57726e2 in ProfilerSignalHandler(int, siginfo*, void*) ()
   from /home/gerv/apps/firefox/2012-06-25/libxul.so
#1  <signal handler called>
#2  0xb77da422 in __kernel_vsyscall ()
#3  0xb7577f7c in fork () from /lib/i386-linux-gnu/libc.so.6
#4  0xb77af464 in fork () from /lib/i386-linux-gnu/libpthread.so.0
#5  0xb4c57555 in ?? () from /lib/i386-linux-gnu/libglib-2.0.so.0
#6  0xb4c58358 in g_spawn_async_with_pipes ()
   from /lib/i386-linux-gnu/libglib-2.0.so.0
#7  0xb4c5840f in g_spawn_async () from /lib/i386-linux-gnu/libglib-2.0.so.0
#8  0xb43175c8 in ?? () from /usr/lib/i386-linux-gnu/libgio-2.0.so.0
#9  0xb4317d27 in ?? () from /usr/lib/i386-linux-gnu/libgio-2.0.so.0
#10 0xb42b268a in g_app_info_launch_uris ()
   from /usr/lib/i386-linux-gnu/libgio-2.0.so.0
#11 0xb42b27b8 in g_app_info_launch_default_for_uri ()
   from /usr/lib/i386-linux-gnu/libgio-2.0.so.0
#12 0xb2c80bf2 in nsGIOService::ShowURIForInput(nsACString const&) ()
   from /home/gerv/apps/firefox/2012-06-25/components/libmozgnome.so
#13 0xb59dec6d in nsLocalFile::Launch() ()
   from /home/gerv/apps/firefox/2012-06-25/libxul.so
#14 0xb666ead0 in NS_InvokeByIndex_P ()
   from /home/gerv/apps/firefox/2012-06-25/libxul.so
#15 0xb6479e94 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) () from /home/gerv/apps/firefox/2012-06-25/libxul.so
#16 0xb647ec9b in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) ()
   from /home/gerv/apps/firefox/2012-06-25/libxul.so
#17 0xb683be33 in js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) () from /home/gerv/apps/firefox/2012-06-25/libxul.so
#18 0xb683022e in js::Interpret(JSContext*, js::StackFrame*, js::InterpMode) ()
   from /home/gerv/apps/firefox/2012-06-25/libxul.so
#19 0xb683b966 in js::RunScript(JSContext*, js::StackFrame*) ()
   from /home/gerv/apps/firefox/2012-06-25/libxul.so
#20 0xb683c2a1 in js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) () from /home/gerv/apps/firefox/2012-06-25/libxul.so
#21 0xb680bd12 in js::CallOrConstructBoundFunction(JSContext*, unsigned int, JS::Value*) () from /home/gerv/apps/firefox/2012-06-25/libxul.so
#22 0xb683be33 in js::InvokeKernel(JSContext*, JS::CallArgs, js::MaybeConstruct) () from /home/gerv/apps/firefox/2012-06-25/libxul.so
---Type <return> to continue, or q <return> to quit---

Gerv

Comment 1

5 years ago
We're using SA_RESTART, which means that if we send the signal during the fork syscall, the syscall gets restarted, probably only to be interrupted by the signal handler again, and the same happens over and over.  Not sure how we should fix this though...
Maybe we should inhibit the profiler with pthread_atfork.
Please see bug 757186 comment 22. I would hugely appreciate if somebody fixed this.

Updated

5 years ago
Duplicate of this bug: 859071
As Gerv said this can freeze the browser for a really long time. Examples can be seen in my now duped bug 859071.
Severity: normal → critical
Keywords: hang
status-firefox20: --- → affected
status-firefox21: --- → affected
status-firefox22: --- → affected
status-firefox23: --- → affected
status-firefox-esr17: --- → unaffected
tracking-firefox21: --- → ?
tracking-firefox22: --- → ?
tracking-firefox23: --- → ?

Updated

5 years ago
Duplicate of this bug: 737328

Updated

5 years ago
Summary: Profiler causes serious performance regression when opening files from Download Manager → Profiler livelock caused by SA_RESTART restarting fork() syscall
(Assignee)

Comment 7

5 years ago
Created attachment 735682 [details] [diff] [review]
A possible fix

Comments 1 and 2 seem plausible to me, and also appear to match the
symptoms I see.  This patch _appears_ to fix the problem for me, but
needs testing by other folks.  It is pretty straightforward, but
completely ignores the issue of thread-safety, so should really be
redone using CAS or some such.

It has the side effect of permanently disabling the signal delivery in
the child.  I don't know if that is a good or bad thing -- it could
easily be avoided by having a paf_child routine to set in_fork to
false too.
Comment on attachment 735682 [details] [diff] [review]
A possible fix

Review of attachment 735682 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/platform-linux.cc
@@ +85,5 @@
> +// great deal.  A side effect of this is to permanently disable
> +// sampling in the child process.  See bug 837390.
> +static bool in_fork = false; // Raced on
> +static void paf_prepare(void) { in_fork = true; }
> +static void paf_parent(void)  { in_fork = false; }

Maybe we want to use SetPaused instead. It uses a NoBarrier_Store which may be ok. This means that we may still interrupt the fork but only up to a limited number of times (once?). And if it doesn't do what we want then we should just fix Pause.
Comment on attachment 735682 [details] [diff] [review]
A possible fix

Review of attachment 735682 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/platform-linux.cc
@@ +216,5 @@
> +  static bool atfork_set = false; // Raced on
> +  if (!atfork_set) {
> +    pthread_atfork(paf_prepare, paf_parent, NULL);
> +    atfork_set = true;
> +  }

You could work around the race here with something like:

static void* setup_atfork()
{
  pthread_atfork(paf_prepare, paf_parent, NULL);
  return NULL;
}

static void* SenderEntry(void* arg)
{
  static void* initialize_atfork = setup_atfork();
  ...
  return initialize_atfork;
}

And let the compiler do the hard work of atomic variable setting and whatnot.
This is not a user facing issue, so no need to track for release. We'll take low risk uplifts in support of developer workflow though.
tracking-firefox21: ? → -
tracking-firefox22: ? → -
tracking-firefox23: ? → -
(In reply to Alex Keybl [:akeybl] from comment #10)
> This is not a user facing issue.

Actually, it is, in all versions where the developer tools have the profiler. Which I think spans from beta to central.
(In reply to Mike Hommey [:glandium] from comment #11)
> (In reply to Alex Keybl [:akeybl] from comment #10)
> > This is not a user facing issue.
> 
> Actually, it is, in all versions where the developer tools have the
> profiler. Which I think spans from beta to central.

We decided to disable the developer tools profiler on release until we're ready to ship it. Until then it's behind a single pref.
(Assignee)

Comment 13

5 years ago
Created attachment 737241 [details] [diff] [review]
Revised patch

Revised patch, that uses IsPaused/SetPaused instead, and also
uses Nathan's suggestion for doing the pthread_atfork call just once.
Attachment #735682 - Attachment is obsolete: true
Attachment #737241 - Flags: review?(bgirard)
Comment on attachment 737241 [details] [diff] [review]
Revised patch

Review of attachment 737241 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.

This patch has the potential to fail if during the fork the profiler pause state changes due to some other race with another thread. This could only happen if the fork doesn't happen on the main thread. I think I'm ok ruling this as unlikely.

Did this fix the error locally?

::: tools/profiler/platform-linux.cc
@@ +242,4 @@
>    Sampler::PlatformData* data =
>        reinterpret_cast<Sampler::PlatformData*>(arg);
>    data->SignalSender();
> +  return initialize_atfork; // which is guaranteed to be NULL

Maybe this should just return null then? SignalSender will run for the duration of the thread so it's awfully late to propagate a return value.
Attachment #737241 - Flags: review?(bgirard) → review+
(In reply to Benoit Girard (:BenWa) from comment #14)
> > +  return initialize_atfork; // which is guaranteed to be NULL
> 
> Maybe this should just return null then? SignalSender will run for the
> duration of the thread so it's awfully late to propagate a return value.

Returning it helps avoid unused variable warnings.
Comment on attachment 737241 [details] [diff] [review]
Revised patch

Review of attachment 737241 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/platform-linux.cc
@@ +106,5 @@
> +    sActiveSampler->SetPaused(was_paused);
> +}
> +
> +// Set up the fork handlers.  This is called just once, at module
> +// load time.

It's worth noting that the way you have this written, this function *will not* execute at module load time (i.e. as a static constructor) but the first time SenderEntry is called.  You may want to shuffle the code around if you actually want it called at module load time.
You're right. Static has a nasty bug of not being thread safe if two threads enter this function at the first time before the initializer runs (at least with msvc). In this case only one thread will ever call this so we're safe. The comment should be fixed.
(Assignee)

Comment 18

5 years ago
(In reply to Benoit Girard (:BenWa) from comment #14)
> This patch has the potential to fail if during the fork the profiler pause
> state changes due to some other race with another thread.

Well, the original version of the patch didn't have this problem,
since it used a new, global boolean as an extra gating condition for
sending the signals, and didn't mess with the paused state of the
sampler.  Which solution do you prefer?
(Assignee)

Comment 19

5 years ago
(In reply to Benoit Girard (:BenWa) from comment #14)
> Did this fix the error locally?

Sorry, missed this before.  Yes it does -- I'm fairly sure.  Without the patch
I can reproduce the problem pretty reliably on my maptop.  With it, I could not
get any problem in four consecutive startups.
(Assignee)

Comment 20

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #16)
> It's worth noting that the way you have this written, this function *will
> not* execute at module load time (i.e. as a static constructor) but the
> first time SenderEntry is called.  You may want to shuffle the code around
> if you actually want it called at module load time.

I can fix the comment, and I think we're OK if the pthread_atfork
call only gets done at the first time SenderEntry is called.  But I'd
prefer -- from a safety point of view -- to do it earlier, at module
load time.  I am unclear what you mean by "shuffle code around"
though.  Do you have some specific transformation in mind?
(In reply to Julian Seward from comment #20)
> (In reply to Nathan Froyd (:froydnj) from comment #16)
> > It's worth noting that the way you have this written, this function *will
> > not* execute at module load time (i.e. as a static constructor) but the
> > first time SenderEntry is called.  You may want to shuffle the code around
> > if you actually want it called at module load time.
> 
> I can fix the comment, and I think we're OK if the pthread_atfork
> call only gets done at the first time SenderEntry is called.  But I'd
> prefer -- from a safety point of view -- to do it earlier, at module
> load time.  I am unclear what you mean by "shuffle code around"
> though.  Do you have some specific transformation in mind?

You just move the call to setup_atfork to the top level:

static void* dummy = setup_atfork();

static void* SenderEntry(void* arg)
{
  Sampler::PlatformData* data = ...
  ...
}

Boo for more static constructors, though. :)
(In reply to Nathan Froyd (:froydnj) from comment #21)
> (In reply to Julian Seward from comment #20)
> > (In reply to Nathan Froyd (:froydnj) from comment #16)
> > > It's worth noting that the way you have this written, this function *will
> > > not* execute at module load time (i.e. as a static constructor) but the
> > > first time SenderEntry is called.  You may want to shuffle the code around
> > > if you actually want it called at module load time.
> > 
> > I can fix the comment, and I think we're OK if the pthread_atfork
> > call only gets done at the first time SenderEntry is called.  But I'd
> > prefer -- from a safety point of view -- to do it earlier, at module
> > load time.  I am unclear what you mean by "shuffle code around"
> > though.  Do you have some specific transformation in mind?
> 
> You just move the call to setup_atfork to the top level:
> 
> static void* dummy = setup_atfork();
> 
> static void* SenderEntry(void* arg)
> {
>   Sampler::PlatformData* data = ...
>   ...
> }
> 
> Boo for more static constructors, though. :)

I don't see the benefit of doing this using a static constructor. The profiler is used by a subset of users on a subset of their time so I'd really like to avoid this. Ideally we only need to setup a TLS+off state+PseudoStack when the profiler is off. Why not set this up the first time sampler_start is called?
(Assignee)

Comment 23

5 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/e2c25a98cebb
https://hg.mozilla.org/mozilla-central/rev/e2c25a98cebb
Assignee: nobody → jseward
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
status-firefox23: affected → fixed
Works great with Mozilla/5.0 (X11; Linux x86_64; rv:23.0) Gecko/20130421 Firefox/23.0 ID:20130421031002 CSet: 0d50cb959c46

Sad to see that we don't want to backport the patch. I'm using the profiler a lot in my daily profile with an Aurora build. I will have to keep it disabled until Aurora get the bump to 23.0.
Status: RESOLVED → VERIFIED
(In reply to Henrik Skupin (:whimboo) from comment #25)
> Sad to see that we don't want to backport the patch. I'm using the profiler
> a lot in my daily profile with an Aurora build. I will have to keep it
> disabled until Aurora get the bump to 23.0.

That's in case not true. I misremembered what has been said in comment 10. So I would appreciate if we can get this backported.
Marking status-firefox23:verified based on comment based on comment 25.
status-firefox23: fixed → verified
You need to log in before you can comment on or make changes to this bug.