Closed Bug 1382928 Opened 7 years ago Closed 7 years ago

Intermittent deadlocks in GeckoProfiler gtest

Categories

(Core :: Gecko Profiler, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: mstange, Assigned: mstange)

Details

Attachments

(2 files, 1 obsolete file)

The test is using the pattern

thread->Dispatch(runnable, NS_DISPATCH_SYNC)

in order to run code synchronously on another thread.

This spins the event loop on the main thread while waiting for the runnable to complete. While I hadn't planned for this, there's also not a good reason why spinning the event loop should break anything, and in this case I think it has uncovered a real issue.

The deadlock is a circular dependency between CheckResponsivenessTask::mMutex, the timer thread's thread startup mutex, and the profiler state lock.

The main thread is running a CheckResponsivenessTask. That task is trying to launch a timer thread while holding mMutex.
The separate thread that the test launches is currently executing profiler_stop, which is holding the profiler state lock. Inside profiler_lock, locked_profiler_stop is calling ThreadInfo::StopProfiling() which calls CheckResponsivenessTask::Terminate, which wants to grab the CheckResponsivenessTask's mMutex, which is currently being held by the main thread.
In order for the main thread to make progress, the timer thread needs to finish launching. But the timer thread is currently trying to register itself with the profiler, and doing that is blocked on the profiler state lock.
Yikes! Nice diagnosis. I will be interested to see what the fix looks like :)
Attached file full backtrace
I think making mStop an atomic should be all that's needed. That's the only variable that's protected by the mutex.
Comment on attachment 8888948 [details]
Bug 1382928 - Make CheckResponsivenessEvent::mStop an atomic.

https://reviewboard.mozilla.org/r/159968/#review165426

I've never got my head around this file. Are you sure mMutex only protects mStop? What about mTimer and mLastTracerTime?

::: tools/profiler/gecko/ThreadResponsiveness.cpp:93
(Diff revision 1)
>  
>    NS_DECL_ISUPPORTS_INHERITED
>  
>  private:
>    TimeStamp mLastTracerTime;
>    Mutex mMutex;

If this lands as is, you can remove:
- mMutex;
- the #include "mozilla/Mutex.h";
- the "using" lines involving Mutex.
Attachment #8888948 - Flags: review?(n.nethercote)
Apologies for the slow review; I did it 3 days ago but apparently forgot to hit submit :(
Attachment #8888948 - Attachment is obsolete: true
(In reply to Nicholas Nethercote [:njn] from comment #5)
> I've never got my head around this file. Are you sure mMutex only protects
> mStop? What about mTimer and mLastTracerTime?

mTimer is only accessed on the main thread, and mLastTracerTime is explicitly documented as being unprotected. So in that way, the patch I had here didn't make things worse.

But that's not a great state of affairs. So I've now gone ahead and tried to make mLastTracerTime an atomic as well, so that it's not subject to data races any more. Unfortunately, you can't store TimeStamp in an atomic, and you can't store a double in an atomic either, so now I'm storing uint64_t(profiler_time() * 1000.0). Microseconds should be accurate enough, and 2^64us ≈ 585000 years, so the range should be acceptable, too.
Comment on attachment 8892269 [details]
Bug 1382928 - Use atomics for thread-shared pieces of state in ThreadResponsiveness.cpp.

https://reviewboard.mozilla.org/r/163206/#review168576

I have numerous comments about names below. Each one reflects my train of thought as I was looking at each piece of code. Having now finished the review, I suggest these renamings:

aNow --> aStartToNow_ms
mLastTracerTime --> aStartToPrevTracer_us (or aStartToPrevResp_us?)
GetLastTracerTime() --> GetStartToLastTracer_ms (or "_us" suffix?)

A bit clunky, but the AToB form makes the start and end points clear, and the suffix makes the units clear. What do you think?

r=me with some improvement of the names.

::: tools/profiler/gecko/ThreadResponsiveness.h:27
(Diff revision 2)
>  
>    void Update();
>  
> -  mozilla::TimeDuration GetUnresponsiveDuration(const mozilla::TimeStamp& now) const {
> -    return now - mLastTracerTime;
> +  // The number of milliseconds that elapsed since the last
> +  // CheckResponsivenessTask was processed.
> +  double GetUnresponsiveDuration(double aNow) const {

It's confusing that the call to this function above has the expression "aNow - CorePS::ProcessStartTime()", but this argument is also called "aNow". Maybe "aNowDuration"? Or "aMSSinceProcessStart"? Not sure, but it helps if the differing units are clear in the name.

::: tools/profiler/gecko/ThreadResponsiveness.h:28
(Diff revision 2)
>    void Update();
>  
> -  mozilla::TimeDuration GetUnresponsiveDuration(const mozilla::TimeStamp& now) const {
> -    return now - mLastTracerTime;
> +  // The number of milliseconds that elapsed since the last
> +  // CheckResponsivenessTask was processed.
> +  double GetUnresponsiveDuration(double aNow) const {
> +    return aNow - *mLastTracerTime;

I also have a pet peeve with "Last" in variable names -- does it mean "previous" or "final"? :)  "Prev" or "Previous" would be better here, I think.

::: tools/profiler/gecko/ThreadResponsiveness.cpp:83
(Diff revision 2)
>      mStop = true;
>    }
>  
> -  const TimeStamp& GetLastTracerTime() const {
> -    return mLastTracerTime;
> +  // Can be called on any thread.
> +  double GetLastTracerTime() const {
> +    return mLastTracerTime / 1000.0;

Are you sure you want to return milliseconds here?

::: tools/profiler/gecko/ThreadResponsiveness.cpp:128
(Diff revision 2)
>  
>  void
>  ThreadResponsiveness::Update()
>  {
>    mActiveTracerEvent->DoFirstDispatchIfNeeded();
> -  mLastTracerTime = mActiveTracerEvent->GetLastTracerTime();
> +  mLastTracerTime = Some(mActiveTracerEvent->GetLastTracerTime());

Is this a unit error? mLastTracerTime is measured in microseconds, but GetLastTracerTime() returns milliseconds, as far as I can tell.

I suggest adding suffixes ("_ms" and "_us" or "MS" and "US") to these time identifiers to make the units clearer.
Attachment #8892269 - Flags: review?(n.nethercote) → review+
I like your name suggestions. Thanks!
Comment on attachment 8892269 [details]
Bug 1382928 - Use atomics for thread-shared pieces of state in ThreadResponsiveness.cpp.

https://reviewboard.mozilla.org/r/163206/#review168576

> Is this a unit error? mLastTracerTime is measured in microseconds, but GetLastTracerTime() returns milliseconds, as far as I can tell.
> 
> I suggest adding suffixes ("_ms" and "_us" or "MS" and "US") to these time identifiers to make the units clearer.

This was not an error, but it was extremely confusing: We have two members called mLastTracerTime, one is on CheckResponsivenessEvent, and one on ThreadResponsiveness. And their units are different.

Now with the suffix that's hopefully less confusing.
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/3b7945df033d
Use atomics for thread-shared pieces of state in ThreadResponsiveness.cpp. r=njn
> We have two members
> called mLastTracerTime, one is on CheckResponsivenessEvent, and one on
> ThreadResponsiveness. And their units are different.

lol
https://hg.mozilla.org/mozilla-central/rev/3b7945df033d
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: