Closed Bug 1157343 Opened 5 years ago Closed 5 years ago

TSan: data race image/src/ProgressTracker.cpp:384 SyncNotify

Categories

(Core :: ImageLib, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox40 --- fixed

People

(Reporter: froydnj, Assigned: seth)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tsan])

Attachments

(2 files)

Attached file sync-notify-race.txt
The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

* Specific information about this bug

Looks like we're accessing mImage across threads without locks.

There's a second race included in the log where TSan detects that we're racing on the vtable of RasterImage, which is probably its own brand of fun and crashes.

There's also a host of other races following after these two in the TSan log in calls from SyncNotify; I'm not going to file races for those, since I suspect they'd all have the same root cause as this one.

* General information about TSan, data races, etc.

Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2].

If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
I think the two races are really the same thing.

In theory we could get away with just using an Atomic<Image*> for mImage, but I feel like we're playing a bit fast-and-loose with a weak pointer in this code. I'm just going to use a mutex instead, so we can take a strong reference atomically, which will help me sleep at night. I'm not convinced this code is hot enough that it's worth preferring an atomic.
Here's the fix.
Attachment #8596224 - Flags: review?(tnikkel)
Assignee: nobody → seth
Status: NEW → ASSIGNED
BTW, I think it's worth noting that I suspect the reason this happens is ultimately because of this NotifyListener() call in imgLoader::LoadImage(), and the analogous one in imgLoader::LoadImageWithChannel():

https://dxr.mozilla.org/mozilla-central/source/image/src/imgLoader.cpp#2336

This ends up creating an AsyncNotifyRunnable to do the notification. If this is an additional load for an image which is already loading, then it's possible for AsyncNotifyRunnable::Run() (where we'll touch mImage via ProgressTracker::SyncNotify) to race with imgRequest::OnDataAvailable() (where we'll touch mImage via ProgressTracker::SetImage).
Attachment #8596224 - Flags: review?(tnikkel) → review+
Thanks for the review! Try also looks good, so I think we're ready to push this.
https://hg.mozilla.org/mozilla-central/rev/5f97bf645c13
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/0c438bdbc45f

FWIW, I've seen more crashes in various talos suites on the other branches where the backout hasn't landed yet and not one has had an identical signature.
I investigated this failure, and it pointed to a further _serious_ problem - in my opinion, more serious than the TSan issue. We use ProgressTrackerInit to register the image with its ProgressTracker _while the image's constructor is running_. ProgressTracker can then both internally touch the refcount on the image and hand out a reference to the image via ProgressTracker::GetImage(). This is totally unsafe, because if we take such a reference and then drop it while the constructor is still running, the image's refcount can hit zero before it's been assigned to an nsRefPtr. We can end up running its destructor before the constructor even returns. This is very bad news.

I'm going to file another bug to fix that issue. Fixing it should make the patch in this bug safe.
Depends on: 1159409
OK, looks like tp5o is nice and green in those tests, so I think this should be safe to reland now that bug 1159409 has landed.
https://hg.mozilla.org/mozilla-central/rev/33d085b0ca40
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
You need to log in before you can comment on or make changes to this bug.