ThreadSanitizer: data race [@ DocumentTimeline::GetCurrentTimeAsDuration]
Categories
(Core :: DOM: Animation, defect)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox100 | --- | fixed |
People
(Reporter: Gankra, Assigned: hiro)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 1 obsolete file)
I'm seeing this in my WIP branch to properly instrument the rust stdlib.
Looks like mLastRefreshDriverTime just needs to be atomic.
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
| Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
The severity field is not set for this bug.
:boris, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 2•5 years ago
|
||
Good catch. Thanks for taking this. Mark S2 for now.
| Reporter | ||
Comment 3•5 years ago
|
||
This is the naive solution. Not sure if the two accesses should be tweaked to be a single hold of
the lock to enforce mutual exclusion. TimeStamps are a bit too special for me to want to wrap
them in an atomic.
| Assignee | ||
Comment 4•5 years ago
|
||
Note that we can probably use mLastRefreshDriverTime directly in
DocumentTimeline::GetCurrentTimeStamp(), i.e. we don't need to use the refresh
driver there, but I'd preserve the current behavior.
| Assignee | ||
Comment 5•5 years ago
|
||
Stealing. Gankra, can you please abandon your patch? I couldn't find the way on my end. :) Thanks!
Updated•5 years ago
|
Comment 8•5 years ago
|
||
| bugherder | ||
Comment 9•5 years ago
|
||
Backed out for causing bug 1682472
Failure log: https://treeherder.mozilla.org/logviewer?job_id=324665581&repo=autoland&lineNumber=1832
[task 2020-12-16T10:35:51.086Z] 10:35:51 INFO - remoteautomation.py | Application pid: 2986
[task 2020-12-16T10:35:54.349Z] 10:35:54 INFO - 0 INFO SimpleTest START
[task 2020-12-16T10:35:54.349Z] 10:35:54 INFO - 1 INFO TEST-START | dom/animation/test/document-timeline/test_document-timeline.html
[task 2020-12-16T10:36:04.446Z] 10:36:04 INFO - Buffered messages logged at 10:35:54
[task 2020-12-16T10:36:04.446Z] 10:36:04 INFO - 2 INFO TEST-PASS | dom/animation/test/document-timeline/test_document-timeline.html | Web Animations API: DocumentTimeline tests - Web Animations API: DocumentTimeline tests: Elided 1 passes or known failures.
[task 2020-12-16T10:36:04.446Z] 10:36:04 INFO - Buffered messages finished
[task 2020-12-16T10:36:04.446Z] 10:36:04 WARNING - 3 INFO TEST-UNEXPECTED-FAIL | dom/animation/test/document-timeline/test_document-timeline.html | document.timeline.currentTime value tests - document.timeline.currentTime value tests: assert_greater_than_equal: document.timeline.currentTime is positive or zero expected a number greater than or equal to 0 but got -1007
[task 2020-12-16T10:36:04.446Z] 10:36:04 INFO - 4 INFO TEST-PASS | dom/animation/test/document-timeline/test_document-timeline.html | Web Animations API: DocumentTimeline tests - Web Animations API: DocumentTimeline tests: Elided 4 passes or known failures.
[task 2020-12-16T10:36:04.447Z] 10:36:04 INFO - 5 INFO TEST-OK | dom/animation/test/document-timeline/test_document-timeline.html | took 1433ms
Comment 10•5 years ago
|
||
Comment 11•5 years ago
|
||
Backout merged
https://hg.mozilla.org/mozilla-central/rev/4ca1bc82b6cb
| Assignee | ||
Comment 12•5 years ago
•
|
||
Why does it fail so frequent on our CI? I haven't been able to reproduce the failure locally, on Linux with verify mode, on Android emulator with --run-until-failure --repeat=10000 so far either.
| Reporter | ||
Comment 13•5 years ago
|
||
Not sure, how are you enabling tsan locally?
| Reporter | ||
Comment 14•5 years ago
|
||
Oops, this is the right one.
tsan is pretty fragile, to say the least.
| Assignee | ||
Comment 15•5 years ago
|
||
What I meant is the failure on Android debug WebRender in comment 9, not TSAN things.
| Assignee | ||
Comment 16•5 years ago
|
||
It turned out the EffectCompositor::PreTraverseInSubtree is not a good place to call UpdateLastRefreshDriverTime() since it's only called when there is at least one animation. We should call UpdateLastRefreshDriverTime() in DocumentTimeline::WillRefresh() which is invoked every refresh driver's tick.
Comment 17•5 years ago
|
||
Comment 18•5 years ago
|
||
Backed out as requested by Hiro
Backout link : https://hg.mozilla.org/integration/autoland/rev/4d881be6bcbe4055f56429d4605d9f5da528d2fd
Comment 19•5 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:hiro, could you have a look please?
For more information, please visit auto_nag documentation.
| Assignee | ||
Comment 20•5 years ago
|
||
Yeah, this is still on my plate, but I noticed that the way I did is not good, it actually causes intermittent test failures that the tests query document timeline current time at very beginning of document load (before onload event is fired).
So what we should do instead is to return 0 current timeline time if we don't yet have the nsDOMNavigationTiming for the target document.
Comment 21•4 years ago
•
|
||
hi hiro -- looks like this has been quiet a while. Does this still feel like S2-level severity? (not sure how serious/severe the data race is in this case)
| Assignee | ||
Comment 22•4 years ago
|
||
(In reply to Hiroyuki Ikezoe (:hiro) from comment #20)
Yeah, this is still on my plate, but I noticed that the way I did is not good, it actually causes intermittent test failures that the tests query document timeline current time at very beginning of document load (before onload event is fired).
So the failed test I meant is probably dom/animation/test/document-timeline/test_document-timeline.html. Here is a link to a new try run where the test failed; https://treeherder.mozilla.org/jobs?repo=try&revision=b20a6f10f3b47e1a921fcf9c5b8a6de6c2287d5a&selectedTaskRun=J8YK2y7fSU-D95E9gSxkpw.0
For the record the failure message is;
TEST-UNEXPECTED-FAIL | dom/animation/test/document-timeline/test_document-timeline.html | document.timeline.currentTime value tests - document.timeline.currentTime value tests: assert_greater_than_equal: document.timeline.currentTime is positive or zero expected a number greater than or equal to 0 but got -998.8
And
So what we should do instead is to return 0 current timeline time if we don't yet have the nsDOMNavigationTiming for the target document.
it turns out using 0 for the current timeline time for the case where nsDOMNavigationTiming hasn't been set is not good idea since it will break our PendingAnimationTracker at all.
A missing part in D97823 was that when the nsDOMNavigationTiming is set, we don't update DocumentTimeline::mLastRefreshDriverTime, with updating DocumentTimeline::mLastRefreshDriverTime (with calling DocumentTimeline::UpdateLastRefreshDriverTime) in Document::SetNavigationTiming the test in question works fine. I will ask boris to re-review the change.
| Assignee | ||
Comment 23•4 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #21)
hi hiro -- looks like this has been quiet a while. Does this still feel like S2-level severity? (not sure how serious/severe the data race is in this case)
I didn't answer to this in the previous comment. :) Yes I believe this is still high severity, this is a race in parallel styling, in the stylo development we did a bunch of efforts to eliminate mutating something during parallel styling without locking, it was an important factor of stylo. we should eliminate it.
Comment 24•4 years ago
|
||
Comment 25•4 years ago
|
||
| bugherder | ||
Description
•