Closed Bug 1157343 Opened 5 years ago Closed 5 years ago
TSan: data race image/src/Progress
Tracker .cpp:384 Sync Notify
The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer). * Specific information about this bug Looks like we're accessing mImage across threads without locks. There's a second race included in the log where TSan detects that we're racing on the vtable of RasterImage, which is probably its own brand of fun and crashes. There's also a host of other races following after these two in the TSan log in calls from SyncNotify; I'm not going to file races for those, since I suspect they'd all have the same root cause as this one. * General information about TSan, data races, etc. Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) . If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.  http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong  _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
I think the two races are really the same thing. In theory we could get away with just using an Atomic<Image*> for mImage, but I feel like we're playing a bit fast-and-loose with a weak pointer in this code. I'm just going to use a mutex instead, so we can take a strong reference atomically, which will help me sleep at night. I'm not convinced this code is hot enough that it's worth preferring an atomic.
Here's the fix.
Attachment #8596224 - Flags: review?(tnikkel)
Assignee: nobody → seth
Status: NEW → ASSIGNED
BTW, I think it's worth noting that I suspect the reason this happens is ultimately because of this NotifyListener() call in imgLoader::LoadImage(), and the analogous one in imgLoader::LoadImageWithChannel(): https://dxr.mozilla.org/mozilla-central/source/image/src/imgLoader.cpp#2336 This ends up creating an AsyncNotifyRunnable to do the notification. If this is an additional load for an image which is already loading, then it's possible for AsyncNotifyRunnable::Run() (where we'll touch mImage via ProgressTracker::SyncNotify) to race with imgRequest::OnDataAvailable() (where we'll touch mImage via ProgressTracker::SetImage).
Attachment #8596224 - Flags: review?(tnikkel) → review+
Thanks for the review! Try also looks good, so I think we're ready to push this.
Backed out for various tp crashes. https://hg.mozilla.org/integration/mozilla-inbound/rev/0c438bdbc45f https://treeherder.mozilla.org/logviewer.html#?job_id=9262137&repo=mozilla-inbound https://treeherder.mozilla.org/logviewer.html#?job_id=9268382&repo=mozilla-inbound https://treeherder.mozilla.org/logviewer.html#?job_id=9260538&repo=mozilla-inbound
Merge of backout: https://hg.mozilla.org/mozilla-central/rev/0c438bdbc45f FWIW, I've seen more crashes in various talos suites on the other branches where the backout hasn't landed yet and not one has had an identical signature.
I investigated this failure, and it pointed to a further _serious_ problem - in my opinion, more serious than the TSan issue. We use ProgressTrackerInit to register the image with its ProgressTracker _while the image's constructor is running_. ProgressTracker can then both internally touch the refcount on the image and hand out a reference to the image via ProgressTracker::GetImage(). This is totally unsafe, because if we take such a reference and then drop it while the constructor is still running, the image's refcount can hit zero before it's been assigned to an nsRefPtr. We can end up running its destructor before the constructor even returns. This is very bad news. I'm going to file another bug to fix that issue. Fixing it should make the patch in this bug safe.
OK, looks like tp5o is nice and green in those tests, so I think this should be safe to reland now that bug 1159409 has landed.
You need to log in before you can comment on or make changes to this bug.