Open Bug 1278588 Opened 8 years ago Updated 2 years ago

ASSERTION: Potential deadlock detected (ObservedDocShell::ClearMarkers(), locking mOffTheMainThreadTimelineMarkers)

Categories

(Core :: DOM: Navigation, defect, P3)

defect

Tracking

()

Tracking Status
firefox50 --- affected

People

(Reporter: MatsPalmgren_bugz, Unassigned)

References

Details

(Whiteboard: btpp-followup-2016-06-14)

Per comment bug 1258079 comment 3:
(In reply to Randell Jesup [:jesup])
> Matt (sic) - that's a totally different bug; it's deadlocking in
> ObservedDocShell::ClearMarkers(), locking mOffTheMainThreadTimelineMarkers. 
> Please file a bug in that component

Log snippet with stack:
https://bug1258079.bmoattachments.org/attachment.cgi?id=8736401

(note that it happened a while ago, on 2016-03-30)
Flags: needinfo?(ttromey)
Whiteboard: btpp-followup-2016-06-14
The last time a deadlock came up here was:

https://bugzilla.mozilla.org/show_bug.cgi?id=1253516

IIRC during that investigation :fitzgen said that other code paths looked questionable
from a locking perspective.

I'll try to take a look at it soon.
I'm going to try to solve this one in bug 1283887; but if not I'll handle it here.
Assignee: nobody → ttromey
Status: NEW → ASSIGNED
Flags: needinfo?(ttromey)
I haven't been able to reproduce this.

I thought it might be worthwhile to spell out what I did, for double-checking.

After looking at the stack trace and reading the informative comment in
xpcom/glue/DeadlockDetector.h, my understanding is that the bug is potential
deadlock due to inconsistent lock ordering.  In particular, sometimes the
order is first sMutex and then the observed docshell mutex; and sometimes the
reverse.

First, I did a debug build, to enable the deadlock detection.
I made sure the deadlock detection was working by running mach cppunittest
and looking for the deadlock detection test.
Then, I ran the dom/animation/test/chrome/test_restyles.html.
I ran it many times (e.g., with --repeat 30 and with --repeat 100) and both with
and without chaos mode (using "MOZ_CHAOSMODE=0").
This never showed the warning in question.

Since the original log doesn't include the stack traces of the offending mutex
acquisitions, it's hard to know the "reversed" spot.  One suspect is the
deadlock found in bug 1253516.

Mats - can you still reproduce and if so, how?
Flags: needinfo?(mats)
No, I haven't seen this assertion since that one occurrence.
Flags: needinfo?(mats)
We've started to seeing this assertion again only on Mac OSX 10.10.  See bug 1244897 comment 18.  Tom, any suspicions around the starting day ?
Flags: needinfo?(ttromey)
See Also: → 1244897
It seems to me that the bug could occur if ObservedDocShell::PopMarkers causes a GC.
At this point, the ObservedDocShellMutex would be held.

Then the GC can call CycleCollectedJSContext::GCNurseryCollectionCallback to add a marker.
This will call TimelineConsumers::AddMarkerForAllObservedDocShells, locking sMutex.

However, many other calls lock sMutex first before acquiring ObservedDocShellMutex.

This is just a theory because the logs only have stack traces for one of the mutex acquisition paths.

There is some code in ObservedDocShell to avoid pushing markers that were created
during the course of popping.  One idea for a fix might be to move this logic to
TimelineConsumers.

Is there a reliable way to reproduce this bug?
Flags: needinfo?(ttromey) → needinfo?(hiikezoe)
(In reply to Tom Tromey :tromey from comment #6)

> Is there a reliable way to reproduce this bug?

No, I've never seen this assertion locally.  I thought this assertion has been fixed because we hadn't seen the failure (bug 1244897) for six months.
Flags: needinfo?(hiikezoe)
Blocks: 1244897
as this is is blocking bug 1244897 which is one of our higher frequency intermittents, I would like to understand the next steps of this bug.
(In reply to Joel Maher ( :jmaher) from comment #8)
> as this is is blocking bug 1244897 which is one of our higher frequency
> intermittents, I would like to understand the next steps of this bug.

It's been on the back burner both due to lack of time and also my perception that the
bug was very intermittent and not reproducible.  If it's more intermittent then I
can make it a higher priority.
I just realized that there is the possibility that bug 1305325 causes this dead lock.  And I guess it will be fixed by bug 1324966.
Blocks: 1332970
Priority: -- → P3
Docshell markers are going to be removed in bug 1421651.
Maybe I should close this as wontfix, but meanwhile marking the dependency
and dropping it.
Assignee: ttromey → nobody
Status: ASSIGNED → NEW
Depends on: 1421651
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.