Closed Bug 1281295 Opened 8 years ago Closed 8 years ago

TSan: data race writing to libnestegg's hlist_null, harmless so it could be suppressed to avoid future confusion

Categories

(Core :: Audio/Video: Playback, defect, P5)

47 Branch
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: mozbugz, Assigned: kinetik)

References

Details

Attachments

(1 obsolete file)

As seen in: https://treeherder.mozilla.org/logviewer.html#?job_id=22651204&repo=try#L3842 Excerpts: >Write of size 8 at 0x7f9f2ede6ba0 by main thread: > #0 hlist_del media/libnestegg/src/hlist.h:116:13 > #7 mozilla::WebMDemuxer::~WebMDemuxer() dom/media/webm/WebMDemuxer.cpp:179:1 > #14 mozilla::TrackBuffersManager::ShutdownDemuxers() dom/media/mediasource/TrackBuffersManager.cpp:784 > #15 mozilla::TrackBuffersManager::~TrackBuffersManager() dom/media/mediasource/TrackBuffersManager.cpp:115:3 > #21 mozilla::dom::SourceBuffer::~SourceBuffer() dom/media/mediasource/SourceBuffer.cpp:325 > #22 mozilla::dom::SourceBuffer::~SourceBuffer() dom/media/mediasource/SourceBuffer.cpp:321:1 > #23 mozilla::DOMEventTargetHelper::DeleteCycleCollectable() dom/events/DOMEventTargetHelper.cpp:85:1 > #24 mozilla::DOMEventTargetHelper::cycleCollection::DeleteCycleCollectable(void*) obj-firefox/dist/include/mozilla/DOMEventTargetHelper.h:65:3 > #25 SnowWhiteKiller::~SnowWhiteKiller() xpcom/base/nsCycleCollector.cpp:2685:9 >Previous write of size 8 at 0x7f9f2ede6ba0 by thread T57: > #0 hlist_del media/libnestegg/src/hlist.h:116:13 > #7 mozilla::WebMDemuxer::NestEggContext::Init() dom/media/webm/WebMDemuxer.cpp:151:10 > #8 mozilla::WebMDemuxer::ReadMetadata() dom/media/webm/WebMDemuxer.cpp:268:11 > #9 mozilla::WebMDemuxer::Init() dom/media/webm/WebMDemuxer.cpp:189:7 > #10 mozilla::TrackBuffersManager::ResetDemuxingState() dom/media/mediasource/TrackBuffersManager.cpp:826:29 > #11 mozilla::TrackBuffersManager::SegmentParserLoop() dom/media/mediasource/TrackBuffersManager.cpp:689:11 I think the important part here is ShutdownDemuxers() being called on the main thread from ~TrackBuffersManager(). ShutdownDemuxers() resets a few RefPtr's that should only be accessed on the assigned TrackBuffersManager's task queue. As is evidenced in the above stacks, WebMDemuxer (referenced through TrackBuffersManager::mInputDemuxer) is being destroyed on the main thread while still being used on the TrackBuffersManager's task queue. Possibly solutions: A. ~TrackBuffersManager() should sync-dispatch ShutdownDemuxers on the task queue (and we should sprinkle some task queue assertions in all methods). B. Users of mInputDemuxer (and other RefPtr's that ShutdownDemuxers may reset) should take an extra reference locally, so that these objects may not be destroyed while being used. A seems better to me, but maybe there's a risk of shutdown-hang? B is heavier but avoids a sync-dispatch; however it may present other risks depending on what actions actually happen, e.g. maybe they call back into the TrackBuffersManager which is in the middle of being destroyed? Jean-Yves and JW, you know this code better; comments?
As I noted in bug 1273769 comment 6, task queue runnables usually keep a reference to the TrackBuffersManager, so I'm not sure how it can get released in the main thread while there is still a task waiting in the queue or being run. So the problem&solution may not be as obvious as I first thought.
Is the underlying heap (halloc, h_free) used by libnestegg thread-safe? I don't see any use of locks.
(In reply to JW Wang [:jwwang] from comment #2) > Is the underlying heap (halloc, h_free) used by libnestegg thread-safe? I > don't see any use of locks. it shouldn't matter. it's only ever called on a single thread, and the use of taskqueue should make it that it's only ever access on the taskqueue. that the destructor is called while a task is currently pending indicate something wrong in the taskqueue: the destructor shouldn't have been called while the task/taskqueue is holding a strong reference on the TrackBuffersManager. Weither libnestegg is thread safe or not is irrelevant (none of the demuxers are thread-safe). it should never be used across multiple threads at once per design. but obviously something isn't working as intended.
(In reply to JW Wang [:jwwang] from comment #2) > Is the underlying heap (halloc, h_free) used by libnestegg thread-safe? I > don't see any use of locks. Just wanted to confirm that libnestegg is *not* thread-safe when accessing the same context. But it should be fine to access separate contexts from different threads, as there is no shared data between them. So as the log and Jean-Yves suggest, something has gone wrong, that allows simultaneous access to the same data: - From a running task in thread T57, - From ~TrackBuffersManager() in the main thread, due to ~SourceBuffer presumably dropping the last reference to its tracks -- But this is unexpected, because the running task in T57 should still hold a reference! I'm trying to go through all "TrackBuffersManager*", to see if anyone of these non-owning pointers could be the problem. And also trying to verify that the TaskQueue implementation is solid and doesn't allow simultaneous runs. But so far so good. Just a nit: (In reply to Jean-Yves Avenard (On PTO until July 10) [:jya] from comment #3) > [libnestegg]'s only ever accessed on the taskqueue. I don't think that's strictly correct. Instead, the thread-safe ref-counting of TrackBuffersManager, along with RefPtr's that *should* reference it from all TaskQueue runnables, should ensure that ~TrackBuffersManager will only run *after* all tasks have run, in which case it is safe to run in any thread, as nothing else should touch this data anymore.
If the destructor is not called on the task queue (e.g. from the main thread instead), ShutdownDemuxers() is sync-dispatched to the task queue, to ensure that it cannot run in parallel with other tasks and cause data races or UAF. This is wall-papering over the issue that the destructor is sometimes called even though a task is actually running or is queued, which should own the TrackBuffersManager and therefore prevent its destruction! However the cause hasn't been found yet, so in the meantime this patch should help reduce the crash rate for our users. Review commit: https://reviewboard.mozilla.org/r/61158/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/61158/
Attachment #8766127 - Flags: review?(jwwang)
Is it possible to use NS_IMPL_RELEASE_WITH_DESTROY to customize the destroy function to run the destructor on the task queue thread? https://hg.mozilla.org/mozilla-central/file/e45890951ce77c3df05575bd54072b9f300d77b0/dom/media/MediaTimer.cpp#l20
Comment on attachment 8766127 [details] Bug 1281295 - Ensure TrackBuffersManager work happens on task queue - https://reviewboard.mozilla.org/r/61158/#review58114
Attachment #8766127 - Flags: review?(jwwang)
Comment on attachment 8766127 [details] Bug 1281295 - Ensure TrackBuffersManager work happens on task queue - https://reviewboard.mozilla.org/r/61158/#review58300 jwwang: "Is it possible to use NS_IMPL_RELEASE_WITH_DESTROY to customize the destroy function to run the destructor on the task queue thread?" Interesting idea, but: It would be a bit of work to implement. And most importantly NS_IMPL_RELEASE_WITH_DESTROY&friends assume that the thread where the object is created is the "owning thread", where destruction should also take place (which usually makes sense). But TrackBuffersManager is created on the main thread by SourceBuffer: https://hg.mozilla.org/mozilla-central/annotate/b69a5bbb5e40bd426e35222baa600b481e50d265/dom/media/mediasource/MediaSource.cpp#l411 So we would in fact force this situation where TBM is destroyed on the main thread, without protecting against the simulataneous access from the task queue. As this is wall-papering against the race between the main thread destruction and some work still (wrongly) happening on the task queue, I think the current patch is the safest fix. And hopefully it's only temporary until we can get the proper fix in place. Setting r? again, for you to re-review in light of this.
Comment on attachment 8766127 [details] Bug 1281295 - Ensure TrackBuffersManager work happens on task queue - Review request updated; see interdiff: https://reviewboard.mozilla.org/r/61158/diff/1-2/
Attachment #8766127 - Flags: review?(gsquelart) → review?(jwwang)
Assignee: nobody → gsquelart
Comment on attachment 8766127 [details] Bug 1281295 - Ensure TrackBuffersManager work happens on task queue - https://reviewboard.mozilla.org/r/61158/#review58318 ::: dom/media/mediasource/TrackBuffersManager.cpp:119 (Diff revision 2) > { > + if (OnTaskQueue()) { > - ShutdownDemuxers(); > + ShutdownDemuxers(); > + } else { > + nsCOMPtr<nsIRunnable> runnable = > + NewRunnableMethod(this, &TrackBuffersManager::ShutdownDemuxers); Will that increase the ref-count to 1 and then drop to 0 and cause double-free?
Attachment #8766127 - Flags: review?(jwwang)
https://reviewboard.mozilla.org/r/61158/#review58318 > Will that increase the ref-count to 1 and then drop to 0 and cause double-free? Yeah, sounds like it would happen. You just can't win! Strange that it didn't show up in tries. Anyway, I'll see what I can do...
Alright, after further consideration and a better look at it, I think this data race is a in fact a harmless positive due to libnestegg's hlist, it has nothing to do with TrackBuffersManager! Here is another TSan report that better shows the problem: http://archive.mozilla.org/pub/firefox/try-builds/sendchange-unittest-b6238efdea7c63e6c5ea1d6962ceeab0fe26eae7/try-linux64-tsan/try_ubuntu64_vm_tsan_test-mochitest-media-bm117-tests1-linux64-build0.txt.gz > Write of size 8 at 0x7f572e9e6c88 by thread T63: > #0 hlist_del media/libnestegg/src/hlist.h:116:13 > #1 halloc media/libnestegg/src/halloc.c:108 > ... > Previous write of size 8 at 0x7f572e9e6c88 by thread T18: > #0 hlist_del media/libnestegg/src/hlist.h:116:13 > #1 halloc media/libnestegg/src/halloc.c:108 > ... > Location is global 'hlist_null' of size 16 at 0x7f572e9e6c80 (libxul.so+0x00000659ec88) The last line is the clue I previously missed! 'hlist_null' is a global sentinel value. All hlists finish by pointing to it, and sometimes even modify its contents (e.g.: in hlist_del when it does 'next->prev = i->prev;'), but otherwise never actually do anything real with it. It's the unchecked writing into the sentinel that triggered the data race. But since whatever is written should never get read, it actually doesn't matter. Is there a way to tell TSan about that? On the way, another potential data race is in halloc(), where a global 'halloc_allocator' is set the first time it is checked and is null. If halloc() was first called from two threads, it could theoretically see a null in both threads, and then write the default allocator from both threads; but since there is one default allocator, it *shouldn't* matter that it's written twice. Now there is an 'assert(! allocator);' in 'halloc_set_allocator', which could trigger in this case! So we may want to do a proper thread-safety check, or alternatively force this initial setting by calling it early in the Firefox lifecycle, before multi-thread calls can happen. Matthew, you know more about libnestegg, would you mind dealing with this?
Assignee: gsquelart → nobody
Flags: needinfo?(kinetik)
Summary: TSan: data race involving ~TrackBuffersManager() -> ShutdownDemuxers() on main thread → TSan: data race writing to libnestegg's hlist_null or halloc_allocator
Attachment #8766127 - Attachment is obsolete: true
Ah good find
I'm going to mark this as a P2 based on Gerald's assessment that this is a harmless race.
Priority: -- → P2
halloc is a third party library that nestegg uses, so if we change halloc itself we should try to upstream the changes. I'll take a look at this sometime, but it's pretty far down my queue. (In reply to Gerald Squelart [:gerald] from comment #12) > But since whatever is written should never get read, it actually doesn't > matter. Is there a way to tell TSan about that? You can supply compile or run time suppression lists, but that seems like a pain to do. > On the way, another potential data race is in halloc(), where a global > 'halloc_allocator' is set the first time it is checked and is null. > [...] or alternatively force > this initial setting by calling it early in the Firefox lifecycle, before > multi-thread calls can happen. We already do this in XPCOMInit.cpp via nestegg_set_halloc_func.
Assignee: nobody → kinetik
Flags: needinfo?(kinetik)
(In reply to Matthew Gregan [:kinetik] from comment #15) > halloc is a third party library that nestegg uses, so if we change halloc > itself we should try to upstream the changes. I'll take a look at this > sometime, but it's pretty far down my queue. If we're using it right, we probably don't need to do anything. Maybe we could just ping someone at Google to have a look at this, in case they want to do something (e.g. better doc about this gotcha.) > (In reply to Gerald Squelart [:gerald] from comment #12) > > But since whatever is written should never get read, it actually doesn't > > matter. Is there a way to tell TSan about that? > > You can supply compile or run time suppression lists, but that seems like a > pain to do. I see. I'll keep this bug open, at a lower priority, in case we want to do this one day... > > On the way, another potential data race is in halloc(), where a global > > 'halloc_allocator' is set the first time it is checked and is null. > > [...] or alternatively force > > this initial setting by calling it early in the Firefox lifecycle, before > > multi-thread calls can happen. > > We already do this in XPCOMInit.cpp via nestegg_set_halloc_func. Great, thank you for checking.
Assignee: kinetik → nobody
Severity: major → normal
Priority: P2 → P5
Summary: TSan: data race writing to libnestegg's hlist_null or halloc_allocator → TSan: data race writing to libnestegg's hlist_null, harmless so it could be suppressed to avoid future confusion
Depends on: 1296988
Fixed by the removal of halloc in bug 1296988.
Assignee: nobody → kinetik
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: