ThreadSanitizer: data race [@ mozilla::layers::AsyncPanZoomController::ApplyAsyncTestAttributes] vs. [@ MarkAsyncTransformAppliedToContent]
Categories
(Core :: Panning and Zooming, defect, P3)
Tracking
()
People
(Reporter: decoder, Assigned: botond)
References
(Blocks 2 open bugs)
Details
Attachments
(2 files)
The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision 6989fcd6bab3.
This race looks a little weird to me at first:
MarkAsyncTransformAppliedToContent
is setting a bool field to true
here:
and AsyncPanZoomController::ApplyAsyncTestAttributes
is reading and writing to mTestAttributeAppliers
at the same time here:
https://searchfox.org/mozilla-central/rev/ea63a0888d406fae720cf24f4727d87569a8cab5/gfx/layers/apz/src/AsyncPanZoomController.cpp#4218
https://searchfox.org/mozilla-central/rev/ea63a0888d406fae720cf24f4727d87569a8cab5/gfx/layers/apz/src/AsyncPanZoomController.cpp#4226
(I'm seeing race reports both for read/write and write/write) with these two locations).
These fields are right next to each other:
and TSan is reporting the write at the exact same location. Could this be due to alignment of these two fields? I don't know what the consequences would be in that case.
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp
. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
Reporter | ||
Comment 1•3 years ago
|
||
Reporter | ||
Comment 2•3 years ago
|
||
This is probably the compiler optimizing the read/write to those fields into larger reads/writes (maybe 64bit aligned) because they are both declared as bitfields. Nathan confirmed that this is legit for the compiler to do. This could have the side-effect that one of the two fields is overwritten with the wrong value when the other is updated and both writes are racing (in the double-write case).
Nathan suggests to use int8_t
instead of int : 8
to avoid this. I will give that a try (push).
Assignee | ||
Comment 3•3 years ago
|
||
Ha! I've been seeing this assertion fail intermittently ever since I've added the ApplyAsyncTestAttributes()
code, and I could not for the life of me figure out why. Your theory about the compiler merging memory writes finally provides an explanation for this :)
Reporter | ||
Comment 4•3 years ago
|
||
(In reply to Botond Ballo [:botond] from comment #3)
Your theory about the compiler merging memory writes finally provides an explanation for this :)
Fwiw, I wrote a simple CPP example involving two racing pthreads and a class with similar bitfields and could easily confirm the behavior. In my example, the compiler seems to be using 64-bit reads/writes.
Comment 5•3 years ago
|
||
The priority flag is not set for this bug.
:jbonisteel, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 6•3 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #2)
Nathan suggests to use
int8_t
instead ofint : 8
to avoid this. I will give that a try (push).
Did you end up giving this a try? With this diagnosis in hand, we should be able to fix this easily. I can take over the bug if desired.
Reporter | ||
Comment 7•3 years ago
|
||
(In reply to Botond Ballo [:botond] from comment #6)
(In reply to Christian Holler (:decoder) from comment #2)
Nathan suggests to use
int8_t
instead ofint : 8
to avoid this. I will give that a try (push).Did you end up giving this a try? With this diagnosis in hand, we should be able to fix this easily. I can take over the bug if desired.
Yes, this fix works, but will increase memory consumption because you will consume a full byte per flag instead. If that's not a problem in your use case, then this is an easy fix. The alternative would be locking for writing these fields.
Assignee | ||
Comment 8•3 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #7)
Yes, this fix works, but will increase memory consumption because you will consume a full byte per flag instead. If that's not a problem in your use case, then this is an easy fix.
Yep, taking the full byte is fine in this case. The object that this is a field of (AsyncPanZoomController
) has relatively few instances of it existing (one per active scrollable element on a page), and has many other fields which are larger. Making this field a bitfield was just a case of premature optimization (which is the root of all evil, I know...)
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 10•3 years ago
|
||
A write to a bitfield can legally be optimized to a write of the entire
machine word containing it, which makes ensuring thread safety more
complicated (we'd have to make sure the entire word forms a critical section
such that one bitfield in it can't be read while another is written).
Thread safety in APZ is hard enough as it, and AsyncPanZoomController has
a large sizeof() already, we can just take up a byte for each field.
Depends on D64485
Comment 11•3 years ago
|
||
Pushed by bballo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9aaf2c037d8f Don't use bitfield members in AsyncPanZoomController. r=kats
Comment 12•3 years ago
|
||
bugherder |
Comment 13•3 years ago
|
||
Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•3 years ago
|
Description
•