Open Bug 1623541 Opened 4 years ago Updated 4 months ago

ThreadSanitizer: data race [@ mozilla::gfx::VRShMem::PullSystemState] vs. [@ mozilla::gfx::VRShMem::PushSystemState]

Categories

(Core :: WebVR, defect, P2)

x86_64
Linux
defect

Tracking

()

REOPENED

People

(Reporter: jkratzer, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: leave-open, stalled, testcase)

Attachments

(5 files)

The attached crash information was detected while fuzzing with ThreadSanitizer on mozilla-central revision 20200318-70f8ce3e2d39.

For detailed crash information, see attachment.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

Attached file testcase.html
Group: core-security → gfx-core-security
Blocks: domino

The priority flag is not set for this bug.
:kip, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kgilbert)
Assignee: nobody → kgilbert
Flags: needinfo?(kgilbert)

The mozilla::gfx::VRShMem::PullSystemState and mozilla::gfx::VRShMem::PushSystemState methods implement a lock-free, queue-free IPC used for hard-realtime communication of VR hardware sensor inputs. They include dual generation-id checks for dirty copies, which are then discarded in the case of a data race.

For non-x86 platforms (eg, aarch64 for Android), a mutex is used in place of the lock-free data structure due to the non-guarantee of memory write order and cache coherent operation.

Would it be appropriate to flag this as an exception?

Flags: needinfo?(jkratzer)

(In reply to :kip (Kearwood Gilbert) from comment #4)

The mozilla::gfx::VRShMem::PullSystemState and mozilla::gfx::VRShMem::PushSystemState methods implement a lock-free, queue-free IPC used for hard-realtime communication of VR hardware sensor inputs. They include dual generation-id checks for dirty copies, which are then discarded in the case of a data race.

For non-x86 platforms (eg, aarch64 for Android), a mutex is used in place of the lock-free data structure due to the non-guarantee of memory write order and cache coherent operation.

Would it be appropriate to flag this as an exception?

We can add this as an exception to the suppression list. That said, I am not sure if you have any guarantees on x86 either. Your approach seems to rely on undefined behavior and the compiler can assume the race cannot happen and reorder things appropriately.

We can ask Nathan about guarantees we might (not) have on x86 when he is back from PTO :)

In the meantime, can you craft a patch that adds VRShMem::PullSystemState and VRShMem::PushSystemState to mozglue/build/TsanOptions.cpp and flag me for review? Thanks!

Flags: needinfo?(jkratzer) → needinfo?(kgilbert)
Flags: needinfo?(kgilbert)
Priority: -- → P2

I'm hitting this on mochitest-browser-chrome and will add a suppression now.

(In reply to :kip (Kearwood Gilbert) from comment #4)

For non-x86 platforms (eg, aarch64 for Android), a mutex is used in place of the lock-free data structure due to the non-guarantee of memory write order and cache coherent operation.

If this doesn't get a proper fix on x86, can we instead enable the mutex also for TSan builds? This should fix the issue from TSan's perspective.

However, I still recommend someone should review the assumptions you are making on x86. The data race here is undefined behavior and I don't think the outlined guarantees exist if the compiler starts to optimize the code accordingly.

(In reply to :kip (Kearwood Gilbert) from comment #4)

The mozilla::gfx::VRShMem::PullSystemState and mozilla::gfx::VRShMem::PushSystemState methods implement a lock-free, queue-free IPC used for hard-realtime communication of VR hardware sensor inputs. They include dual generation-id checks for dirty copies, which are then discarded in the case of a data race.

For non-x86 platforms (eg, aarch64 for Android), a mutex is used in place of the lock-free data structure due to the non-guarantee of memory write order and cache coherent operation.

Alexis, does this sound like the crossbeam approach you've mentioned the other day? Do you know if the guarantees used here are indeed guaranteed on x86? If so, I assume the best fix for this is to use the mutex for TSan builds just like it is used on Android?

Flags: needinfo?(a.beingessner)

I believe the issues crossbeam were twofold:

The chase-lev deque algorithm is inherently data-racy. The steal operation first reads data from the buffer and then checks whether something changed in the meantime. If so, the read data is discarded (using mem::forget) and the steal operation has to be retried. Put differently, a data race might happen, but we never follow up on it so UB doesn't happen.

The epoch-based memory reclamation fundamentally requires fences to synchronize epochs (generations).

I'm not sure what they're describing is either of these things, without a name for the technique.

Flags: needinfo?(a.beingessner)

I just hit a similar-looking trace while bringing up wpt under tsan -- does that look like it's the same underlying issue?

The leave-open keyword is there and there is no activity for 6 months.
:jimm, maybe it's time to close this bug?

Flags: needinfo?(jmathies)

This looks like it's still an active issue, just stalled since webvr isn't currently maintained.

Flags: needinfo?(jmathies)
Keywords: stalled
Assignee: kearwood → nobody
Severity: normal → S3

As WebVR isn't maintained and we're trying to close out stalled bugs I think we can mark this wontfix

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → WONTFIX

TSan bugs with an active suppression in the code base should never be closed unless the race is fixed or the offending code removed. Doing so can cause confusion with the respective automation. However, as this report is really old and WebVR is disabled/unmaintained, we can open it up. There is also no evidence it has any security impact.

Group: gfx-core-security
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: