Open Bug 1339756 Opened 7 years ago Updated 2 years ago

Enlist compiler support to help delimit profiler critical-section code

Categories

(Core :: Gecko Profiler, task, P3)

task

Tracking

()

Tracking Status
firefox54 --- affected

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(1 file)

On all platforms, the Gecko profiler core contains critical sections
inside which we must be very careful what we do, else we risk
deadlock.  They are:

* Mac: platform-macos.cc, SampleContext(): the section between
  thread_suspend() and thread_resume()

* Windows: platform-win32.cc, SampleContext(): the section between
  SuspendThread() and ResumeThread() at the end

* Linux: platform-linux.cc: exactly the whole of
  ProfilerSignalHandler()

In all cases, we suspend the thread to be sampled at some completely
arbitrary point.  That means that it may hold arbitrary exclusive
resources, particularly locks, and most particularly the malloc lock.
Therefore, if the critical section attempts to acquire any of these
resources, and most particularly attempts any dynamic memory
allocation, it may deadlock.

Currently there is no automated support for making this safe.  We rely
only on programmer diligence.  It would be nice if we could get at
least partial assistance from the compiler or the static analysis
builds for this.  At the bare minimum we should clearly comment all
functions reachable from a critical section.

Some functions in tools/profiler/lul/*.cpp are already marked with
"RUNS IN NO-MALLOC CONTEXT", which is at least a starting point.
WIP patch; incomplete; won't compile; a mess.  Essentially
abandoned for the moment.

The idea was to have a class CSA (Critical Section Auditor) which
merely wraps up a boolean, indicating whether we're currently in
a critical section or not.  It should be created at the start of
the critical section and destroyed at the end.

Then, inside the critical section, each method/procedure/function
call is changed as follows:

(1) Either a ref to the CSA is passed as a new first argument, or

(2) the call is wrapped in an AUDIT macro which takes the CSA ref
    as a first arg.  AUDIT performs the stated call but first checks
    the CSA to see if we're in a critical section or not, and complains
    if so.

You can think of (1) as expanding the perimeter of the audited area,
and (2) as a runtime escape through the boundary into an "unchecked"
area, which elicits a warning message.

I abandoned this, at least in its current form, because:

(1) having to edit every function call is majorly intrusive, time
    consuming and error-prone

(2) because it's done by hand, we get no automated assistance for
    determining if we missed any calls,

(3) because it's more complex than the summary (1)/(2) above implies.
    In particular, there are methods which are called from within a
    critical section, but they are also called from outside such
    sections.  This is why the CSA class contains a boolean and only
    warns about perimeter crossings when that boolean is set.  The
    idea there is that, for a call from outside a critical section into
    the audited area requires creating a dummy CSA that says "no we're
    not currently in a critical section."

(4) It became obvious pretty fast that the critical sections are large,
    encompassing much of the code in tools/profile/core/*.{cpp,h} and
    marking it up would be a lot of work.

(5) It occurred to me later that it would be easy to build a Valgrind
    tool that could observe all allocations and lock/unlock attempts
    inside a critical section, with none of the attendant hassle (1) .. (4).
All that said, it bugs me that we can't do at least some work at
the C++ level that helps in critical-section auditing.
Suggestions welcomed.
(In reply to Julian Seward [:jseward] from comment #1)
> (5) It occurred to me later that it would be easy to build a Valgrind
>     tool that could observe all allocations and lock/unlock attempts
>     inside a critical section, with none of the attendant hassle [..]

This is bug 1341239.
Severity: normal → N/A
Type: defect → task
Priority: -- → P3

Sorry, there was a problem with the detection of inactive users. I'm reverting the change.

Assignee: nobody → jseward
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: