Deadlock in TimelineConsumers::IsEmpty

RESOLVED DUPLICATE of bug 1253516

Status

()

Firefox
Developer Tools: Performance Tools (Profiler/Timeline)
P1
normal
RESOLVED DUPLICATE of bug 1253516
2 years ago
2 years ago

People

(Reporter: dx, Assigned: tromey)

Tracking

47 Branch
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
Created attachment 8767247 [details]
backtraces.txt

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0
Build ID: 20160614130443

Steps to reproduce:

Unknown. Happened while idling.


Actual results:

Main thread stopped responding to events, with 0% cpu usage. Left it like that for two hours.


Expected results:

I expected an additional information textbox in here, but there isn't one, so i'll use this.

Arch linux x86_64, firefox 47 release from mozilla, got debug symbols.

I haven't crashed firefox with SIGABRT yet, it's still running. Since I have no idea what caused this or how to reproduce it, I'd like to keep it running for an extra while (a few hours?) and run any gdb commands that might help. At some point i'd like to use this firefox and i'll SIGABRT and link the report here.

I also generated a 178 gigabyte core file. I suspect that stuff doesn't get uploaded with the bug reporter. I'll have to delete that file at some point.

Snippet of backtrace of main thread (all threads in attachment)

Thread 1 (Thread 0x7f3371356740 (LWP 17456)):
#0  0x00007f3370f80eac in __lll_lock_wait () from /usr/lib/libpthread.so.0
#1  0x00007f3370f7ac66 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
#2  0x00007f336fadcf99 in PR_Lock (lock=0x7f3350367740) at /builds/slave/m-rel-l64-00000000000000000000/build/src/nsprpub/pr/src/pthreads/ptsynch.c:177
#3  0x00007f336116ea58 in mozilla::BaseAutoLock<mozilla::StaticMutex>::BaseAutoLock (aLock=..., this=0x7ffe6ccb32a0) at /builds/slave/m-rel-l64-00000000000000000000/build/src/obj-firefox/dist/include/mozilla/Mutex.h:164
#4  mozilla::TimelineConsumers::IsEmpty (this=this@entry=0x7f335036b460) at /builds/slave/m-rel-l64-00000000000000000000/build/src/docshell/base/timeline/TimelineConsumers.cpp:172
#5  0x00007f335fa85f7e in mozilla::CycleCollectedJSRuntime::GCNurseryCollectionCallback (aRuntime=0x7f335b764000, aProgress=JS::GCNurseryProgress::GC_NURSERY_COLLECTION_START, aReason=JS::gcreason::OUT_OF_NURSERY) at /builds/slave/m-rel-l64-00000000000000000000/build/src/xpcom/base/CycleCollectedJSRuntime.cpp:852
#6  0x00007f335fa182f2 in js::gcstats::Statistics::beginNurseryCollection (reason=JS::gcreason::OUT_OF_NURSERY, this=0x7f335b7645a0) at /builds/slave/m-rel-l64-00000000000000000000/build/src/js/src/gc/Statistics.h:199
#7  js::Nursery::collect (this=this@entry=0x7f335b764450, rt=0x7f335b764000, reason=reason@entry=JS::gcreason::OUT_OF_NURSERY, pretenureGroups=pretenureGroups@entry=0x7ffe6ccb3820) at /builds/slave/m-rel-l64-00000000000000000000/build/src/js/src/gc/Nursery.cpp:402
(Reporter)

Comment 1

2 years ago
The build ID in the description is wrong, that's the one i have in my laptop as built by my distro. The one that deadlocked is build ID 20160604131506 as built by mozilla. Both are firefox 47.0.
(Reporter)

Comment 2

2 years ago
Oh hey I think I figured it out (or, rather, figured out as much as I could without being familiar with the mozilla code). It's right in the backtrace of the main thread, frames #4 and #24.

(gdb) frame 4
#4  mozilla::TimelineConsumers::IsEmpty (this=this@entry=0x7f335036b460) at /builds/slave/m-rel-l64-00000000000000000000/build/src/docshell/base/timeline/TimelineConsumers.cpp:172
172       StaticMutexAutoLock lock(sMutex); // for `mActiveConsumers`.

(gdb) frame 24
#24 0x00007f336116e8b1 in mozilla::TimelineConsumers::PopMarkers (this=this@entry=0x7f335036b460, aDocShell=aDocShell@entry=0x7f321d2b9000, aCx=aCx@entry=0x7f335d79b000, aStore=...)
    at /builds/slave/m-rel-l64-00000000000000000000/build/src/docshell/base/timeline/TimelineConsumers.cpp:309
309       aDocShell->mObserved->PopMarkers(aCx, aStore);

(gdb) l
303       StaticMutexAutoLock lock(sMutex); // for `mMarkersStores`.
304
305       if (!aDocShell || !aDocShell->mObserved) {
306         return;
307       }
308
309       aDocShell->mObserved->PopMarkers(aCx, aStore);
310     }
311


That's mozilla::TimelineConsumers::sMutex twice.
(Reporter)

Comment 3

2 years ago
Well, uh, this isn't a SIGABRT kill crash report, but maybe it helps:

https://crash-stats.mozilla.com/report/index/533eecfc-62a3-4f69-9e58-6b9e22160702

Quoting my own user comment:

>This one might be noise. I had a deadlock (bug #1283887), I did pthread_mutex_unlock(&lock->mutex)
>from gdb, it recovered and started playing a webm, then segfaulted seconds later :(

I expected the crash report to get more details, but I guess I lost those by having gdb catch it. In case it helps, I'm attaching gdb-unlock-then-crash.txt with the commands I issued and the backtrace of the crash.
(Reporter)

Comment 4

2 years ago
Created attachment 8767381 [details]
gdb log of my naive attempt to unlock the mutex, followed by a segfault
Hello Nick, could you please take a look at this issue?
Component: Untriaged → Developer Tools: Performance Tools (Profiler/Timeline)
Flags: needinfo?(nfitzgerald)
I don't have any context for the C++ side of things for triaging. Nick, could you set the priority on this?
This is pretty high priority: crashes and deadlocks aren't just a small corner feature being buggy, they kill the whole browser functionality (ignoring e10s content process crashes, which are recoverable, but still). For that reason, I think this should be P1.

Ideally the author of the code would be the one to fix it, however Victor has moved on. I'm currently a bit swamped, Tromey do you think you could take this? I think you've touched the markers code at least as much or more than I have.

It seems like all the marker functions shouldn't take the locks themselves, but should instead have `AutoLock&` parameters that get threaded through all the calls and then the very top level entry point should be the only place we take the lock. That pattern would have prevented a bug like this, AFAICT.
Flags: needinfo?(nfitzgerald) → needinfo?(ttromey)
Also, dx: thanks for the very thorough bug report and investigation! It helps a lot :)
Priority: -- → P1
(Assignee)

Comment 9

2 years ago
Taking.
Assignee: nobody → ttromey
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Flags: needinfo?(ttromey)
(Assignee)

Comment 10

2 years ago
I think this one is actually just a dup of bug 1253516.
The analysis in comment #2 shows the deadlock occurring because
TimelineConsumers::PopMarkers acquires sMutex; but that was
removed in the aforementioned bug.
This is consistent with the original report which shows the bug occurring
in 47; the fix is in 48.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1253516
You need to log in before you can comment on or make changes to this bug.