Closed Bug 1342102 Opened 3 years ago Closed 3 years ago

Use the same threading structure in platform-linux-android.cpp as for the -macos and -win32 versions

Categories

(Core :: Gecko Profiler, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(2 files, 2 obsolete files)

The -macos and -win32 profiler cores have the following structure:

  N threads being sampled

  one sampler thread, which does this:

    forever {
       for each registered thread t {
          t.suspend
          t.getcontext
          t.Tick (copy potentially lots of stuff out of t's stack)
          t.resume
       }
    }

The -linux-android core has a different scheme:

    sampler thread:
  
    forever {
       for each registered thread t {
          send SIGPROF to t
          wait for t to indicate its SIGPROF handler has finished
            (a.k.a sem_wait(&gSignalHandlingDone))
       }

    SIGPROF handler (runs in each sampled thread) {
       thisthread.getcontext
       thisthread.Tick
       tell the sampler thread we're done
          (a.k.a. sem_post(&gSignalHandlingDone))
    }

The different -linux-android scheme has a number of disadvantages:

(1) Current work to rationalise/clean up the threading structure of
    the profiler is complicated by the need to reason about/verify
    two different schemes.

    In particular, the Tick call in the -win32 and -macos schemes will
    produce its output on the sampler thread.  In the -linux-android
    scheme that is produced on the sampled threads.

(2) It causes a lot of duplicated code in platform-*.cpp.  For example
    SamplerThread::Run() in the -win32 and -macos files are very
    similar.  Ideally all three could be merged into a single file with
    all the identical logic commoned up.

(3) Running lots of code -- the entire contents of Tick -- in a signal
    handler isn't considered good practice.

Doing this will require implementing the (suspend, getcontext, resume)
primitives for Linux.
Doing this would be relatively straightforward if it were possible to
send an extra machine word along with a signal to a thread.  With the
POSIX "realtime signals extension" (which we have in Linux), it is
almost possible to do this using sigqueue() and siginfo.si_value.  But
with sigqueue() we can't deliver signals to a specific thread, so that
scheme doesn't work.

So we have to fall back to using a global variable.  The existing
platform-linux-android.cpp uses a global variable gCurrentThreadInfo
anyway, so we don't lose any generality like this.

Plan is:

  // Used to communicate between sampler and samplee threads
  struct Magic {
     sem_t         acquire_sem;
     sem_t         release_sem;
     sigcontext_t  thread_ctx;
  }

  static Atomic<Magic*> gMagicPtr;

  // We still have a signal handler in the sampled threads, but now
  // it does almost nothing:
  static void
  SigprofHandler(int signal, siginfo_t* info, void* context)
  {
     // Stash the thread context so that sampler thread can use it
     memcpy(gMagicPtr->thread_ctx, context)

     // Tell the sampler thread that the context is valid, and
     // go to sleep
     sem_post(gMagicPtr->acquire_sem);

     // Wait for the sampler thread to un-freeze us
     sem_wait(gMagicPtr->release_sem);
  }

  // Sampler thread is now more or less isomorphic with the
  // -macos/-win32 versions:
  sampler_thread:
  forever {
     for t in registered_threads {
        // perform "suspend" and "getcontext" steps
        struct Magic magic;  // on Sampler thread's stack
        gMagicPtr = &magic;
        memset(gMagicPtr, 0, sizeof(*gMagicPtr))
        sem_init(&gMagicPtr->acquire_sem, 0);
        sem_init(&gMagicPtr->release_sem, 0);

        tgkill(SIGPROF, t)
        sem_wait(acquire_sem);
        
        // t is now frozen and gMagicPtr->thread_ctx is valid
        Tick

        // perform "release" step
        sem_post(&release_sem);
     }
  }
I really like the idea.
Nice! Does this completely eliminate the "we must be careful about how where we call malloc" problem?

BTW, I have two plans to make platform-linux-android.cpp more similar to platform-{win32,macos}.cpp.

- Factor out the common parts of the main "forever" loop so that we have one copy of it, which can go in platform.cpp.

- Implement a SamplerThread class and put a lot of the global state within it. E.g. your gMagicPtr could go within it. This avoids having to clear the global state when the profiler is stopped, because it will instead disappear.

But I'm happy to wait for you to do this first.
(In reply to Nicholas Nethercote [:njn] from comment #3)
> Does this completely eliminate the "we must be careful about how
> where we call malloc" problem?

Alas, no .. this has no bearing on that problem.  This merely 
moves the critical section into the sampler thread so that the
thread structure is identical on all targets.  It doesn't
change the potential-deadlock semantics.

> BTW, I have two plans to make platform-linux-android.cpp more similar to
> platform-{win32,macos}.cpp.

Interesting.  I'll chase this next week and see if it's actually 
feasible.  If yes then we can de-duplicate the platform-* files.
Attached file play.c
Proof-of-concept standalone test program.

There's one dubious thing about this.  In the signal handler we call
both sem_post and sem_wait.  sem_post is "async signal safe" (OK for
use in a signal handler) but sem_wait isn't.  sem_wait can be
interrupted by a signal delivery.  We can restart the wait after such
an interruption, should it occur.

I find myself wondering what happens if the kernel fails to deliver
any of the SIGPROFs to the target thread.  Then the sampling thread
will deadlock.  But that would also be the case for the existing
implementation, so this doesn't make it any worse.
Attached patch bug1342102-1.cset (obsolete) — Splinter Review
WIP patch.

The good news is that it works, more or less as planned.
The bad news is that I had to insert a spinloop, for reasons
written in comments in the patch.  It doesn't spin much though
and with 1KHz sampling on Linux, doesn't seem to take measurably
more CPU than the original version (per watching /bin/top).

* Removes global variables: gCurrentThreadInfo, gRssMemory,
  gUssMemory, gSignalHandlingDone

* Replaces then with a single pointer, gSigHandlerCoordinator,
  to a |struct SigHandlerCoordinator|, which is what is referred to
  as |struct Magic| in comment 1.

* SigprofHandler() now only contains synchronisation code.

* The critical section (copying of the suspended thread's state)
  is now done in the loop in SigprofSender().
Comment on attachment 8841554 [details] [diff] [review]
bug1342102-1.cset

Review of attachment 8841554 [details] [diff] [review]:
-----------------------------------------------------------------

The spin loop is unfortunate. I can think of two possibilities for removing it.

- Use a third semaphore instead. This seems straightforward.

- Move the sem_init() calls from the inner loop to the outer loop. Then if the sem_wait(resume) in the signal handler takes a long time you don't have to worry about the next inner loop iteration causing problems via sem_init(resume). But I'm not sure what happens if signal handler is still waiting when the next inner loop iteration fires another signal. Might be a bad idea.



 AIUI, it's needed because you call sem_init() within the inner loop. What about if you moved it to the outer loop? I think you wouldn't need it then. Then it might be possible for SignalHandler's sem_wait(resume) call to still be incomplete by the time the next inner loop's iteration calls sem_post(resume), in which case it'll be interrupted and instead of looping the Signal handler could just return. (I'm not sure how the errno handling would work in this scenario.)

::: tools/profiler/core/platform-linux-android.cpp
@@ +165,5 @@
>  #endif
>  }
>  
> +// A type used to coordinate between the sampler (signal sending)
> +// thread and the thread currently being sampled (the target).

You're inconsistent in your language -- sometimes you use "target", sometimes "samplee".

@@ +166,5 @@
>  }
>  
> +// A type used to coordinate between the sampler (signal sending)
> +// thread and the thread currently being sampled (the target).
> +struct SigHandlerCoordinator {

Nit: opening brace on its own line.

@@ +167,5 @@
>  
> +// A type used to coordinate between the sampler (signal sending)
> +// thread and the thread currently being sampled (the target).
> +struct SigHandlerCoordinator {
> +  sem_t      suspend_sem;    // samplee waits on this to suspend

Please use standard Mozilla style for field names, e.g. mSuspendSem, mThreadCtx. And no need to indent field names.

@@ +168,5 @@
> +// A type used to coordinate between the sampler (signal sending)
> +// thread and the thread currently being sampled (the target).
> +struct SigHandlerCoordinator {
> +  sem_t      suspend_sem;    // samplee waits on this to suspend
> +  sem_t      resume_sem;     // sampler waits on this for samplee

Nit: no need to indent field names.

@@ +174,5 @@
> +  mozilla::Atomic<int> busy; // set to 1 if the block is in use, else 0
> +};
> +
> +// This the one-and-only global variable used to communicate between
> +// the sampler thread and the samplee's signal handler.

"samplee's signal handler" -- is that really the case? I'm totally unclear on which thread the signal handler runs. I'd find this sentence clearer if you just removed "samplee". In fact, there are various places where I find "samplee" unclear, and "signal handler" would probably be clearer.

@@ +184,2 @@
>  static void
> +SigprofHandler(int signal, siginfo_t* info, void* contextV)

While you're changing this function a lot, please rename these as aSignal, aInfo, aContextV. (Or just aContext? I'm not sure what the "V" is for.)

@@ +198,2 @@
>  
> +  // Tell the sampler thread that the context is valid, and go to

I find this sentence easier to read if you change "and" to "then". (Makes it clearer that "go to sleep" isn't part of the message to the sampler thread.)

@@ +198,3 @@
>  
> +  // Tell the sampler thread that the context is valid, and go to
> +  // sleep.  sem_post can never fail, providing the semaphore is valid.

Nit: this and other comments look like they aren't 79 chars wide. You can fit more on each line :)

@@ +203,5 @@
> +
> +  // Wait for the sampler thread to un-freeze us.  In contrast to
> +  // sem_post, this can fail by being interrupted by a signal.
> +  while (true) {
> +    r = sem_wait(&(gSigHandlerCoordinator->resume_sem));

No need for the extra parens, here and on similar lines throughout the patch.

@@ +325,5 @@
>  
> +        // Suspend the samplee and get its context.
> +        SigHandlerCoordinator coord;  // on sampler thread's stack
> +        PodZero(&coord);
> +        coord.busy = 1;

Please give SigHandlerCoordinator a constructor to make the initialization nicer. I guess you'll have to initialize the fields individually instead of using PodZero, so you can remove the PodOperations.h include above.

@@ +331,5 @@
> +
> +        int r;
> +        r =  sem_init(&(gSigHandlerCoordinator->suspend_sem), /*!pshared*/0, 0);
> +        r |= sem_init(&(gSigHandlerCoordinator->resume_sem),  /*!pshared*/0, 0);
> +        MY_MOZ_RELEASE_ASSERT(r == 0);

The sem_init calls could be in the constructor too.

@@ +336,5 @@
> +
> +        r = tgkill(vm_tgid_, threadId, SIGPROF);
> +        MY_MOZ_RELEASE_ASSERT(r == 0);
> +
> +        // Wait for the signal handler to indicate that is has suspended.

Typo: s/is/it/

@@ +338,5 @@
> +        MY_MOZ_RELEASE_ASSERT(r == 0);
> +
> +        // Wait for the signal handler to indicate that is has suspended.
> +        r = sem_wait(&(gSigHandlerCoordinator->suspend_sem));
> +        MY_MOZ_RELEASE_ASSERT(r == 0);

For the other sem_wait you check for interruption by signals. Why not the same here?

@@ +345,5 @@
> +        // gSigHandlerCoordinator->thread_ctx is valid.  We can poke
> +        // around in it and unwind its stack as we like.
> +
> +        TickSample sample_obj;
> +        TickSample* sample = &sample_obj;

Since you've moving this code, can you rename sample_obj as sample and work with it directly instead of using a pointer? This indirection has been bugging me for a while and I have patches to remove it in the other platform-*.cpp files.

@@ +364,5 @@
> +        MY_MOZ_RELEASE_ASSERT(r == 0);
> +
> +        // At this point, we are racing the samplee.  The sem_post on
> +        // resume_sem just above makes the samplee logically able to
> +        // progress forwards, so that it will exit the signal handler.

This sentence is a good example where my confusion between "signal handler" and "samplee" make things hard to understand...

@@ +365,5 @@
> +
> +        // At this point, we are racing the samplee.  The sem_post on
> +        // resume_sem just above makes the samplee logically able to
> +        // progress forwards, so that it will exit the signal handler.
> +        // But when it resumes making to progress depends on when the

"making to progress"?
Attachment #8841554 - Flags: feedback+
>  AIUI, it's needed because you call sem_init() within the inner loop. What
> about if you moved it to the outer loop? I think you wouldn't need it then.
> Then it might be possible for SignalHandler's sem_wait(resume) call to still
> be incomplete by the time the next inner loop's iteration calls
> sem_post(resume), in which case it'll be interrupted and instead of looping
> the Signal handler could just return. (I'm not sure how the errno handling
> would work in this scenario.)

Sorry, I meant to delete that text. It's basically restating the second bullet point just above it. A third semaphore is probably still a better approach.
FWIW, here's what I wrote down about the three semaphores idea while I was getting my head around it:

- First semaphore: makes the sampler thread wait on the signal handler (so the latter can set up context)
- Second semaphore: makes the signal handler wait on the sampler thread (so the latter can Tick() using the context)
- Third semaphore: makes the sampler thread wait on the signal handler (so the latter can finish cleanly)

Also, even if this approach works out, it might be worth moving the sem_init() calls from the inner loop to the outer loop.
Thanks for the comments and review!  Yes, I later realised that
what is needed is a four-message protocol between the sampler
thread and the thread being sampled.  I missed the need for the
fourth message in my test program, hence the need for the spinlock
kludge.  I'll redo it properly today, removing the spinlock.
Attached patch bug1342102-2.cset (obsolete) — Splinter Review
Revised patch.  Addresses all review comments, has improved
comments, no spinlock, and three semaphores.
Attachment #8841554 - Attachment is obsolete: true
(In reply to Nicholas Nethercote [:njn] from comment #7)

Some comments:

----

I was indeed inconsistent in terminology for the sampling and signal-receiving
threads.  As it stands we don't appear to have decent names for:

(1) the thread that does the sampling
(2) the threads that are being sampled
(3) the signal handler, when it runs

For (2) we say clumsy and vague things like "the signal handler thread".  Note
that (2) and (3) are not the same thing.

I propose the following:

(1) "sampler thread"
(2) "samplee thread"
(3) "samplee thread's signal handler"

(1) and (2) apply equally to the -win32 and -macos variants.

In the revised patch I've used (1), (2) and (3) more consistently and removed
mentions of "target thread".

----

Your analysis about the need for the spinlock matches mine.  I agree that we
need a four-message protocol using a signal and three semaphores, and the
patch implements that.  It also has a big comment documenting the protocol.

You mentioned ..

> - Move the sem_init() calls from the inner loop to the outer loop.

I considered that, but I prefer not to do it, because:

* I have a hard time figuring out what the correctness implications would be.
  It's difficult enough making it not crash or hang as it is.

* Having the sampler's per-thread loop be able to "get ahead" of the threads
  it is sampling (ie, not wait for the signal handlers to exit) might make it
  more difficult to merge all three platform-* implementations together later.

----

> For the other sem_wait you check for interruption by signals. Why not the
> same here?

I made them all check for interruption by signals.

I find it difficult to figure out whether interruption by (other) signals
could cause deadlocking.  So far I haven't seen any problems.

I did notice that the sigaction() call that sets up the signal handler (in
PlatformStart), leaves the sa_mask field clear.  That means that the signal
handler could be interrupted by some other signal.  I'd prefer to set all
entries of the sa_mask field, in an attempt to block all other signals in
the handler, or at least as many as the kernel will let us.  That would be
an extra line of defence against the signal handler getting interrupted.

That can go in a different bug, though.
For a final version of the patch I will of course remove the following hack

#define MY_MOZ_RELEASE_ASSERT(_cond) MOZ_RELEASE_ASSERT(_cond)

and replace MY_MOZ_RELEASE_ASSERT uses with standard MOZ_ASSERTS.  The reason
it is there is that currently the profiler is unusable when built with
assertions, so I had to work with a non-debug build, but I still wanted to
have assertions.
Fixes some comment typos.  No functional change.
Attachment #8841938 - Attachment is obsolete: true
Attachment #8841945 - Flags: review?(n.nethercote)
(In reply to Julian Seward [:jseward] from comment #13)
> For a final version of the patch I will of course remove the following hack
> 
> #define MY_MOZ_RELEASE_ASSERT(_cond) MOZ_RELEASE_ASSERT(_cond)
> 
> and replace MY_MOZ_RELEASE_ASSERT uses with standard MOZ_ASSERTS.  The reason
> it is there is that currently the profiler is unusable when built with
> assertions, so I had to work with a non-debug build, but I still wanted to
> have assertions.

In general I prefer to review patches in their final form. One approach you could use in this case is to have two patches in your queue, the main one, and a follow-up that removes MY_MOZ_RELEASE_ASSERT, and post both for review, along with a promise to fold them together before landing.

Also, again, I'd be happier if I could see the commit message. Especially for a patch like this where it would ideally contain an explanation of why the change is worthwhile.
Comment on attachment 8841945 [details] [diff] [review]
bug1342102-3.cset

Review of attachment 8841945 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. I think you're still referring to the samplee's signal handler as the samplee in several places. Perhaps I'm being too pedantic? After all, the samplee's signal handler is acting on behalf of the samplee. I'll let you decide.

Other than that, a couple of nits, and otherwise looks good to go.

::: tools/profiler/core/platform-linux-android.cpp
@@ +165,5 @@
>  #endif
>  }
>  
> +// A four-message protocol is used to reliably suspend and later resume the
> +// thread to be sampled (the samplee):

This whole comment is good. But it would be nice to have a little more motivation. Specifically:
 
- The only way to reliably interrupt a Linux/Android thread and inspect its register state is via a signal handler.
 
- We don't want to run much code within signal handlers, because that's considered poor form, so we do the minimum we can: grab the register state and pass it back to the sampler thread, which does all the unwinding, etc.
 
That's the crux of it, AIUI.

@@ +167,5 @@
>  
> +// A four-message protocol is used to reliably suspend and later resume the
> +// thread to be sampled (the samplee):
> +//
> +// Sampler (signal sender) thread              Samplee (thread to be sampled)

The right-hand column is really a combination of the samplee and the samplee's signal handler, right?

@@ +174,5 @@
> +// and point gSigHandlerCoordinator at it
> +//
> +// send SIGPROF to samplee ------- MSG 1 ----> (enter signal handler)
> +// wait(mMessage2)                             Copy register state
> +//                                             into gSigHandlerCoordinator

Nit: where descriptions spill onto a 2nd or later line, I would indent those later lines by 2 spaces, to aid readability.

@@ +183,5 @@
> +//
> +// Release samplee:
> +// post(mMessage3)         ------- MSG 3 ----->
> +// wait(mMessage4)                              Samplee now resumes.  Tell
> +//                                              the sampler that we are done.

More accurate to say "Samplee can now resume"? It doesn't actually resume until SigprofHandler() returns, right?

@@ +186,5 @@
> +// wait(mMessage4)                              Samplee now resumes.  Tell
> +//                                              the sampler that we are done.
> +//                         <------ MSG 4 ------ post(mMessage4)
> +// Now we know the samplee has
> +// finished using gSigHandlerCoordinator.

The samplee's signal handler has finished using it.

@@ +206,5 @@
> +{
> +  SigHandlerCoordinator()
> +  {
> +    PodZero(&mUContext);
> +    int r =  sem_init(&mMessage2, /*!pshared*/0, 0);

Nit: I would write:

> int r = sem_init(&mMessage2, /* pshared */0, 0);

i.e. spaces around the comment markers, and no '!'.

@@ +222,5 @@
> +  }
> +
> +  sem_t mMessage2; // "to sampler: context is in gSigHandlerCoordinator"
> +  sem_t mMessage3; // "to samplee: resume"
> +  sem_t mMessage4; // "to sampler: finished with gSigHandlerCoordinator"

Nit: put the opening quotes after the colons?

@@ +253,5 @@
>  
> +  // At this point, the sampler thread assumes we are suspended, so we must
> +  // not touch any global state here.
> +
> +  // Wait for message 3: the sampler thread tells us to resume

Nit: add trailing '.'.

@@ +387,5 @@
> +        // signalling at it.
> +        int r = tgkill(vm_tgid_, threadId, SIGPROF);
> +        MY_MOZ_RELEASE_ASSERT(r == 0);
> +
> +        // Wait for message 2 from the samplee, indicating that the context is

The message is coming from the samplee's signal handler.

@@ +417,5 @@
> +        sample.ussMemory = ussMemory;
> +
> +        Tick(&sample);
> +
> +        // Send message 3 to the samplee, which tells it to resume.

Message goes the samplee's signal handler.

@@ +421,5 @@
> +        // Send message 3 to the samplee, which tells it to resume.
> +        r = sem_post(&gSigHandlerCoordinator->mMessage3);
> +        MY_MOZ_RELEASE_ASSERT(r == 0);
> +
> +        // Wait for message 4 from the samplee, which tells us that it has

Message comes from the samplee's signal handler.
Attachment #8841945 - Flags: review?(n.nethercote) → review+
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e1880fc4d8f
Use the same threading structure in platform-linux-android.cpp as for the -macos and -win32 versions.  r=n.nethercote.
https://hg.mozilla.org/mozilla-central/rev/1e1880fc4d8f
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in before you can comment on or make changes to this bug.