TSan: data race image/src/ProgressTracker.cpp:384 SyncNotify

RESOLVED FIXED in Firefox 40

Status

()

Core
ImageLib
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: froydnj, Assigned: seth)

Tracking

(Blocks: 1 bug)

unspecified
mozilla40
x86_64
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox40 fixed)

Details

(Whiteboard: [tsan])

Attachments

(2 attachments)

(Reporter)

Description

3 years ago
Created attachment 8596070 [details]
sync-notify-race.txt

The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

* Specific information about this bug

Looks like we're accessing mImage across threads without locks.

There's a second race included in the log where TSan detects that we're racing on the vtable of RasterImage, which is probably its own brand of fun and crashes.

There's also a host of other races following after these two in the TSan log in calls from SyncNotify; I'm not going to file races for those, since I suspect they'd all have the same root cause as this one.

* General information about TSan, data races, etc.

Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2].

If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
(Assignee)

Comment 1

3 years ago
I think the two races are really the same thing.

In theory we could get away with just using an Atomic<Image*> for mImage, but I feel like we're playing a bit fast-and-loose with a weak pointer in this code. I'm just going to use a mutex instead, so we can take a strong reference atomically, which will help me sleep at night. I'm not convinced this code is hot enough that it's worth preferring an atomic.
(Assignee)

Comment 2

3 years ago
Created attachment 8596224 [details] [diff] [review]
Protect ProgressTracker::mImage with a mutex

Here's the fix.
Attachment #8596224 - Flags: review?(tnikkel)
(Assignee)

Updated

3 years ago
Assignee: nobody → seth
Status: NEW → ASSIGNED
(Assignee)

Comment 3

3 years ago
BTW, I think it's worth noting that I suspect the reason this happens is ultimately because of this NotifyListener() call in imgLoader::LoadImage(), and the analogous one in imgLoader::LoadImageWithChannel():

https://dxr.mozilla.org/mozilla-central/source/image/src/imgLoader.cpp#2336

This ends up creating an AsyncNotifyRunnable to do the notification. If this is an additional load for an image which is already loading, then it's possible for AsyncNotifyRunnable::Run() (where we'll touch mImage via ProgressTracker::SyncNotify) to race with imgRequest::OnDataAvailable() (where we'll touch mImage via ProgressTracker::SetImage).
Attachment #8596224 - Flags: review?(tnikkel) → review+
(Assignee)

Comment 5

3 years ago
Thanks for the review! Try also looks good, so I think we're ready to push this.
https://hg.mozilla.org/mozilla-central/rev/5f97bf645c13
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox40: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/0c438bdbc45f

FWIW, I've seen more crashes in various talos suites on the other branches where the backout hasn't landed yet and not one has had an identical signature.
(Assignee)

Comment 10

3 years ago
I investigated this failure, and it pointed to a further _serious_ problem - in my opinion, more serious than the TSan issue. We use ProgressTrackerInit to register the image with its ProgressTracker _while the image's constructor is running_. ProgressTracker can then both internally touch the refcount on the image and hand out a reference to the image via ProgressTracker::GetImage(). This is totally unsafe, because if we take such a reference and then drop it while the constructor is still running, the image's refcount can hit zero before it's been assigned to an nsRefPtr. We can end up running its destructor before the constructor even returns. This is very bad news.

I'm going to file another bug to fix that issue. Fixing it should make the patch in this bug safe.
(Assignee)

Updated

3 years ago
Depends on: 1159409
(Assignee)

Comment 15

3 years ago
OK, looks like tp5o is nice and green in those tests, so I think this should be safe to reland now that bug 1159409 has landed.
https://hg.mozilla.org/mozilla-central/rev/33d085b0ca40
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
status-firefox40: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
You need to log in before you can comment on or make changes to this bug.