Closed Bug 1674776 Opened 5 years ago Closed 4 years ago

ThreadSanitizer: data race [@ DocumentTimeline::GetCurrentTimeAsDuration]

Categories

(Core :: DOM: Animation, defect)

defect

Tracking

()

RESOLVED FIXED
100 Branch
Tracking Status
firefox100 --- fixed

People

(Reporter: Gankra, Assigned: hiro)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

I'm seeing this in my WIP branch to properly instrument the rust stdlib.

Looks like mLastRefreshDriverTime just needs to be atomic.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

Assignee: nobody → a.beingessner

The severity field is not set for this bug.
:boris, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(boris.chiou)

Good catch. Thanks for taking this. Mark S2 for now.

Severity: -- → S2
Flags: needinfo?(boris.chiou)

This is the naive solution. Not sure if the two accesses should be tweaked to be a single hold of
the lock to enforce mutual exclusion. TimeStamps are a bit too special for me to want to wrap
them in an atomic.

Note that we can probably use mLastRefreshDriverTime directly in
DocumentTimeline::GetCurrentTimeStamp(), i.e. we don't need to use the refresh
driver there, but I'd preserve the current behavior.

Stealing. Gankra, can you please abandon your patch? I couldn't find the way on my end. :) Thanks!

Assignee: a.beingessner → hikezoe.birchill
Status: NEW → ASSIGNED
Flags: needinfo?(a.beingessner)
Attachment #9188954 - Attachment is obsolete: true

Thanks a bunch, hiro!

Flags: needinfo?(a.beingessner)
Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/847bca011f0f Update DocumentTimeline::mLastRefreshDriverTime outside parallel styling. r=boris,decoder
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch
Regressions: 1682472

Backed out for causing bug 1682472

Push with failures https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception&revision=4ce960f145fc49d5573a062fac57b23bcd4e424f&selectedTaskRun=HNlRCE0ERVOMCvrDRaQFNQ.0&searchStr=Android%2C7.0%2Cx86-64%2CWebRender%2Cdebug%2CMochitests%2Ctest-android-em-7.0-x86_64-qr%2Fdebug-geckoview-mochitest-plain-e10s%2C2

Failure log: https://treeherder.mozilla.org/logviewer?job_id=324665581&repo=autoland&lineNumber=1832

[task 2020-12-16T10:35:51.086Z] 10:35:51 INFO - remoteautomation.py | Application pid: 2986
[task 2020-12-16T10:35:54.349Z] 10:35:54 INFO - 0 INFO SimpleTest START
[task 2020-12-16T10:35:54.349Z] 10:35:54 INFO - 1 INFO TEST-START | dom/animation/test/document-timeline/test_document-timeline.html
[task 2020-12-16T10:36:04.446Z] 10:36:04 INFO - Buffered messages logged at 10:35:54
[task 2020-12-16T10:36:04.446Z] 10:36:04 INFO - 2 INFO TEST-PASS | dom/animation/test/document-timeline/test_document-timeline.html | Web Animations API: DocumentTimeline tests - Web Animations API: DocumentTimeline tests: Elided 1 passes or known failures.
[task 2020-12-16T10:36:04.446Z] 10:36:04 INFO - Buffered messages finished
[task 2020-12-16T10:36:04.446Z] 10:36:04 WARNING - 3 INFO TEST-UNEXPECTED-FAIL | dom/animation/test/document-timeline/test_document-timeline.html | document.timeline.currentTime value tests - document.timeline.currentTime value tests: assert_greater_than_equal: document.timeline.currentTime is positive or zero expected a number greater than or equal to 0 but got -1007
[task 2020-12-16T10:36:04.446Z] 10:36:04 INFO - 4 INFO TEST-PASS | dom/animation/test/document-timeline/test_document-timeline.html | Web Animations API: DocumentTimeline tests - Web Animations API: DocumentTimeline tests: Elided 4 passes or known failures.
[task 2020-12-16T10:36:04.447Z] 10:36:04 INFO - 5 INFO TEST-OK | dom/animation/test/document-timeline/test_document-timeline.html | took 1433ms

Status: RESOLVED → REOPENED
Flags: needinfo?(hikezoe.birchill)
Resolution: FIXED → ---
Target Milestone: 86 Branch → ---
Backout by apavel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4ca1bc82b6cb Backed out changeset 847bca011f0f for causing bug 1682472 on a CLOSED TREE

Why does it fail so frequent on our CI? I haven't been able to reproduce the failure locally, on Linux with verify mode, on Android emulator with --run-until-failure --repeat=10000 so far either.

Not sure, how are you enabling tsan locally?

Oops, this is the right one.

tsan is pretty fragile, to say the least.

What I meant is the failure on Android debug WebRender in comment 9, not TSAN things.

It turned out the EffectCompositor::PreTraverseInSubtree is not a good place to call UpdateLastRefreshDriverTime() since it's only called when there is at least one animation. We should call UpdateLastRefreshDriverTime() in DocumentTimeline::WillRefresh() which is invoked every refresh driver's tick.

Flags: needinfo?(hikezoe.birchill)
Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5d120a154a6b Update DocumentTimeline::mLastRefreshDriverTime outside parallel styling. r=boris,decoder

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:hiro, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(hikezoe.birchill)

Yeah, this is still on my plate, but I noticed that the way I did is not good, it actually causes intermittent test failures that the tests query document timeline current time at very beginning of document load (before onload event is fired).

So what we should do instead is to return 0 current timeline time if we don't yet have the nsDOMNavigationTiming for the target document.

Flags: needinfo?(hikezoe.birchill)
QA Whiteboard: qa-not-actionable

hi hiro -- looks like this has been quiet a while. Does this still feel like S2-level severity? (not sure how serious/severe the data race is in this case)

Flags: needinfo?(hikezoe.birchill)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #20)

Yeah, this is still on my plate, but I noticed that the way I did is not good, it actually causes intermittent test failures that the tests query document timeline current time at very beginning of document load (before onload event is fired).

So the failed test I meant is probably dom/animation/test/document-timeline/test_document-timeline.html. Here is a link to a new try run where the test failed; https://treeherder.mozilla.org/jobs?repo=try&revision=b20a6f10f3b47e1a921fcf9c5b8a6de6c2287d5a&selectedTaskRun=J8YK2y7fSU-D95E9gSxkpw.0

For the record the failure message is;

TEST-UNEXPECTED-FAIL | dom/animation/test/document-timeline/test_document-timeline.html | document.timeline.currentTime value tests - document.timeline.currentTime value tests: assert_greater_than_equal: document.timeline.currentTime is positive or zero expected a number greater than or equal to 0 but got -998.8

And

So what we should do instead is to return 0 current timeline time if we don't yet have the nsDOMNavigationTiming for the target document.

it turns out using 0 for the current timeline time for the case where nsDOMNavigationTiming hasn't been set is not good idea since it will break our PendingAnimationTracker at all.

A missing part in D97823 was that when the nsDOMNavigationTiming is set, we don't update DocumentTimeline::mLastRefreshDriverTime, with updating DocumentTimeline::mLastRefreshDriverTime (with calling DocumentTimeline::UpdateLastRefreshDriverTime) in Document::SetNavigationTiming the test in question works fine. I will ask boris to re-review the change.

Flags: needinfo?(hikezoe.birchill)

(In reply to Daniel Holbert [:dholbert] from comment #21)

hi hiro -- looks like this has been quiet a while. Does this still feel like S2-level severity? (not sure how serious/severe the data race is in this case)

I didn't answer to this in the previous comment. :) Yes I believe this is still high severity, this is a race in parallel styling, in the stylo development we did a bunch of efforts to eliminate mutating something during parallel styling without locking, it was an important factor of stylo. we should eliminate it.

Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3f6ad6b0d556 Update DocumentTimeline::mLastRefreshDriverTime outside parallel styling. r=boris,decoder
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 100 Branch
Regressions: 1761900
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: