Open Bug 1703727 Opened 4 years ago Updated 6 months ago

Unsound `SwCompositeGraphNodeRef` wrapper is UB and creates data races in safe code

Categories

(Core :: Graphics: WebRender, enhancement)

enhancement

Tracking

()

UNCONFIRMED

People

(Reporter: nyanpasu64, Unassigned)

References

(Blocks 1 open bug)

Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0

Steps to reproduce:

While reading a recent Mozilla blog post, I found a bug report about a data race. While reading the patch, I was surprised to see that the fix was switching field accesses to atomic loads/stores in safe code, even though &mut should indicate exclusive access and that writes are safe, and & should indicate no concurrent writes and that reads are safe.

I think "use SwCompositeGraphNodeRef wrapper for cleaner node mutation" is wrong, both from a theoretical point of view (according to Stacked Borrows, I think it's illegal to create a &mut unsynchronized, even if you don't use it to perform subsequent data races), and from a practical point of view (it makes data races trivial to write in safe code, even though the bugfix removed one instance of them occurring, and I don't know if there are others left to be discovered).

Actual results:

Looking at the source of sw_compositor.rs, SwCompositeGraphNodeRef has "safe" methods (line 286) with unsafe blocks, handing out raw &SwCompositeGraphNode and &mut SwCompositeGraphNode pointing within an UnsafeCell, without locking or runtime checking. And worse yet, SwCompositeGraphNodeRef has Deref and DerefMut implementations that create & and &mut, unsynchronized, implicitly performing unsound accesses from call sites that merely look like harmless method calls...

The bugfix for the data race changed safe code to perform atomic shared operations through a &mut. But safe writes/reads should not be able to cause data races, so (in my view) the unsafe code creating the & and &mut from an UnsafeCell is unsound. (It definitely would be a soundness hole as a public API, but this is module-private, so I'm not so sure. I still think it's illegal under Stacked Borrows rules.)

My theory as to how this happened:

  • C++ atomics are not mutable through a const *, so the authors used &mut to indicate mutability, even though Rust's &mut has stronger guarantees. According to the current iteration of the Stacked Borrows memory model, constructing a &mut implies that all other pointers to this object are invalidated, except the pointers the &mut reborrowed from.
  • Maybe they thought "we want to prevent calling &mut SwCompositeGraphNode methods from a &SwCompositeGraphNodeRef". The sync/send rules are odd.
  • Maybe the authors thought it was valid to both &mut from one thread, and & from multiple other threads.
  • Maybe the methods taking &mut were intended to run on 1 thread in one phase, and the methods taking & were intended to run on other threads in different phases. And the phases were never intended to run in parallel, but instead one phase ending was intended to happen-before the next phase beginning. But this was not ensured through synchronizing operations between threads.

Honestly I feel this bug is an instance of "things being randomly accessed on different threads with unclear semantics, necessitating non-trivial refactorings of the code" (which weren't properly performed in this case), despite being written in Rust, where unsafe code should be designed to localize soundness (but wasn't written to do so here).

Expected results:

I don't understand WebRender to the extent I've worked with atomics, but I have some ideas that might be good or not. Perhaps SwCompositeGraphNodeRef should be changed to hold an Arc<SwCompositeGraphNode>, and the non-atomic fields of SwCompositeGraphNode (job and children) should be moved under either individual UnsafeCell, or a single one for both fields. And all &mut self methods under impl SwCompositeGraphNode should be replaced with &self, and either add suitable synchronization before dereferencing UnsafeCell (I don't understand WebRender and didn't deeply review/debug the code, but the current/fixed code still looks vulnerable to data races on job and perhaps children) or mark the methods as unsafe and abdicate responsibility for subsequent UB.

Should I try compiling and running Firefox/WebRender myself, and perhaps contribute to it? Or did I misunderstand this issue which I think is UB? Or is this "software OpenGL shim" code considered unimportant? Or is WebRender not accepting new contributors?

Flags: needinfo?(a.beingessner)

Yeah my fixup was hasty, I was a bit suspicious of all this code but was pretty tired and y'all seemed happy with my fix. It's definitely correct that &mut implies exclusive access (~noalias) so any concurrent code should at most be using & unless their is some kind of locking/exclusivity guarantee (for a more subtle example that doesn't involve mutexes, see Arc::get_mut).

I believe the concern is mostly theoretical, in that miri will probably catch this but you're unlikely to get miscompilation today because the atomics probably force LLVM to be very conservative in codegen. That said, it should definitely be fixed, because we all now how theoretical UB has a tendency to suddenly get practical.

I don't know anything about this code, so I can't immediately recommend what the fix should be. I'm happy to consult on solutions with someone who knows this code better, though.

Flags: needinfo?(a.beingessner) → needinfo?(tnikkel)
Flags: needinfo?(tnikkel) → needinfo?(lsalzman)

Software WebRender is extremely performance sensitive, especially the compositing code, which may be some of the most performance-sensitive code in WebRender. Otherwise stated, performance concerns take priority over just about anything else here. While the code might not be "by the book", it works, ensures atomicity of accesses, and is fast.

As always, patches are welcome, with these concerns in mind.

Type: defect → enhancement
Flags: needinfo?(lsalzman)
Severity: -- → S4
Blocks: wr-todos
You need to log in before you can comment on or make changes to this bug.