Closed Bug 1601940 Opened 4 months ago Closed 1 month ago

ThreadSanitizer: data race [@ mozilla::layers::AsyncPanZoomController::ApplyAsyncTestAttributes] vs. [@ MarkAsyncTransformAppliedToContent]

Categories

(Core :: Panning and Zooming, defect, P3, critical)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla75
Tracking Status
firefox73 --- wontfix
firefox74 --- wontfix
firefox75 --- fixed

People

(Reporter: decoder, Assigned: botond)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision 6989fcd6bab3.

This race looks a little weird to me at first:

MarkAsyncTransformAppliedToContent is setting a bool field to true here:

https://searchfox.org/mozilla-central/rev/ea63a0888d406fae720cf24f4727d87569a8cab5/gfx/layers/apz/src/AsyncPanZoomController.h#1605

and AsyncPanZoomController::ApplyAsyncTestAttributes is reading and writing to mTestAttributeAppliers at the same time here:

https://searchfox.org/mozilla-central/rev/ea63a0888d406fae720cf24f4727d87569a8cab5/gfx/layers/apz/src/AsyncPanZoomController.cpp#4218
https://searchfox.org/mozilla-central/rev/ea63a0888d406fae720cf24f4727d87569a8cab5/gfx/layers/apz/src/AsyncPanZoomController.cpp#4226

(I'm seeing race reports both for read/write and write/write) with these two locations).

These fields are right next to each other:

https://searchfox.org/mozilla-central/rev/ea63a0888d406fae720cf24f4727d87569a8cab5/gfx/layers/apz/src/AsyncPanZoomController.h#1623-1626

and TSan is reporting the write at the exact same location. Could this be due to alignment of these two fields? I don't know what the consequences would be in that case.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

This is probably the compiler optimizing the read/write to those fields into larger reads/writes (maybe 64bit aligned) because they are both declared as bitfields. Nathan confirmed that this is legit for the compiler to do. This could have the side-effect that one of the two fields is overwritten with the wrong value when the other is updated and both writes are racing (in the double-write case).

Nathan suggests to use int8_t instead of int : 8 to avoid this. I will give that a try (push).

Ha! I've been seeing this assertion fail intermittently ever since I've added the ApplyAsyncTestAttributes() code, and I could not for the life of me figure out why. Your theory about the compiler merging memory writes finally provides an explanation for this :)

(In reply to Botond Ballo [:botond] from comment #3)

Your theory about the compiler merging memory writes finally provides an explanation for this :)

Fwiw, I wrote a simple CPP example involving two racing pthreads and a class with similar bitfields and could easily confirm the behavior. In my example, the compiler seems to be using 64-bit reads/writes.

The priority flag is not set for this bug.
:jbonisteel, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jbonisteel)
Component: Graphics → Panning and Zooming
Flags: needinfo?(jbonisteel)
Priority: -- → P3

(In reply to Christian Holler (:decoder) from comment #2)

Nathan suggests to use int8_t instead of int : 8 to avoid this. I will give that a try (push).

Did you end up giving this a try? With this diagnosis in hand, we should be able to fix this easily. I can take over the bug if desired.

Flags: needinfo?(choller)

(In reply to Botond Ballo [:botond] from comment #6)

(In reply to Christian Holler (:decoder) from comment #2)

Nathan suggests to use int8_t instead of int : 8 to avoid this. I will give that a try (push).

Did you end up giving this a try? With this diagnosis in hand, we should be able to fix this easily. I can take over the bug if desired.

Yes, this fix works, but will increase memory consumption because you will consume a full byte per flag instead. If that's not a problem in your use case, then this is an easy fix. The alternative would be locking for writing these fields.

Flags: needinfo?(choller)

(In reply to Christian Holler (:decoder) from comment #7)

Yes, this fix works, but will increase memory consumption because you will consume a full byte per flag instead. If that's not a problem in your use case, then this is an easy fix.

Yep, taking the full byte is fine in this case. The object that this is a field of (AsyncPanZoomController) has relatively few instances of it existing (one per active scrollable element on a page), and has many other fields which are larger. Making this field a bitfield was just a case of premature optimization (which is the root of all evil, I know...)

Duplicate of this bug: 1572150
Assignee: nobody → botond

A write to a bitfield can legally be optimized to a write of the entire
machine word containing it, which makes ensuring thread safety more
complicated (we'd have to make sure the entire word forms a critical section
such that one bitfield in it can't be read while another is written).

Thread safety in APZ is hard enough as it, and AsyncPanZoomController has
a large sizeof() already, we can just take up a byte for each field.

Depends on D64485

Pushed by bballo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9aaf2c037d8f
Don't use bitfield members in AsyncPanZoomController. r=kats
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

You need to log in before you can comment on or make changes to this bug.