1529933 - ARM/ARM64: Use-after-free of previously generated code due to non-synchronized instructions caches

Reporter

Description

•

5 years ago

ARM & ARM64 have can have dirty instruction caches across multiple cores. This problem is explained in 9.2.2 of [1].

If I understand correctly this implies that if one core is writing code and invalidating its instruction cache, other cores might not yet be aware of the changes by the time they execute the code, even if data barriers are used, and therefore other cores might attempt to execute an outdated version of the code which used to be at the same address before it got rewritten.

While working on Bug 1441436, I noticed that WebAssembly is currently flushing instruction caches on its compilation thread. I still have not verified whether the instruction caches were always flushed before being executed but I presume that this might not always be the case.

[1] http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf

Nicolas B. Pierron [:nbp]

Reporter

Updated

•

5 years ago

Blocks: 1526993

Lars T Hansen [:lth]

Assignee

Comment 1

•

5 years ago

Nice find...

Indeed, wasm can create a module that is postMessage'd to another worker, and that worker may start executing it directly, but if that worker runs on a different core that has had other code at those addresses before then it'll read the wrong code. This is the case even without shared memory and atomics and has nothing to do with tiered compilation, say - it's just a basic problem.

(I'm curious what happens if the OS moves a thread from one core to the other. Presumably it must invalidate the icache of the receiving core in this case, or even single-thread programs that generate code might not work. Or the cache must be keyed in some interesting way.)

Assuming that the OS handles this sanely when the thread migrates, the easy way to fix the basic problem is probably to invalidate the icache for a module's addresses when creating an instance from the module, since the instance is strictly per-thread. As an optimization we could do this only for a module that was received from postMessage.

But for tiered compilation there's another wrinkle, since we create the tier-2 module, flush the icache for its code, and then write code addresses into the jump table for the module (this is data, not code), and executing code running tier-1 code will load from that table and jump to any non-null value. That tier-1 code could be running on any core and we only sync the data cache after updating the table (and we don't even need to do that). The code that's being jumped to may be unsynchronized.

So for tiered compilation it looks like we need an all-cores synchronization point after compilation [*] to do icache flushing on other cores before we can write anything into the jump table. Again this is in principle only for modules that have been postMessaged.

(I have to think a little longer about exactly under which conditions we'll need to flush. For example, on the one hand there's a point at which we're done patching in tier-2 code and only tier-2 code ought to run and instance creation after that point could perhaps only flush tier-2 code; on the other hand, I think that, for simplicity, some of the entry points into the module remain tier-1 code indefinitely.)

[*] If the icache sync invalidates the core's cache (as opposed to refreshing it), as I think it ought to, then it may be easier to manage the cross-core flushing when code memory is freed, not when it has been filled with new code.

Nicolas B. Pierron [:nbp]

Reporter

Comment 2

•

5 years ago

For what is worth, the latest version of Bug 1441436 on which I am working on, does not find any issues now on the test suite.

(In reply to Lars T Hansen [:lth] from comment #1)

Indeed, wasm can create a module that is postMessage'd to another worker, and that worker may start executing it directly, but if that worker runs on a different core that has had other code at those addresses before then it'll read the wrong code. This is the case even without shared memory and atomics and has nothing to do with tiered compilation, say - it's just a basic problem.

One option would be to re-call the cache flush with a nullptr and a 0 size, this would execute an extra dsb instruction (in we invert the do-while loop to skip the 0-length case) and do the synchronization required for this case.

Lars T Hansen [:lth]

Assignee

Comment 3

•

5 years ago

We had a chat about this in today's wasm meeting. We still need to verify that this is a problem, but if we find that it is, then we have some basic leads to follow:

Do ARM operating systems have system calls that flush all icaches system-wide to benefit situations like this?
Is it enough to munmap and then mremap a page that is to receive new code to make the OS flush the icaches on all cores in the process?
What do kernels do here? Julian knows people at Linaro and might ask, once we've figured out what we're asking.

Julian Seward [:jseward]

Comment 4

•

5 years ago

Lars and I had a bit of a poke around our coherence code, in particular jit/arm64/vixl/MozCpu-vixl.cpp, function CPU::EnsureIAndDCacheCoherency. This does look as if it has been written with multiprocessors in mind, in particular see the comments at lines 86-88 and 105-109 (https://searchfox.org/mozilla-central/source/js/src/jit/arm64/vixl/MozCpu-vixl.cpp#105-109):

86  // The point of unification for a processor is the point by which the
87  // instruction and data caches are guaranteed to see the same copy of a
88  // memory location. See ARM DDI 0406B page B2-12 for more information.

105  // The point of unification for an Inner Shareable shareability domain is
106  // the point by which the instruction and data caches of all the processors
107  // in that Inner Shareable shareability domain are guaranteed to see the
108  // same copy of a memory location.  See ARM DDI 0406B page B2-12 for more
109  // information.

So it looks to me as if this will correctly cause icache flushing for all cores attached to the same point of unification as the requesting core. Where "point of unification" means, presumably, the shared last-level cache.

I also looked at what I assume is the most recent VIXL sources for this, at https://github.com/armvixl/vixl/blob/master/src/vixl/a64/cpu-a64.cc, and they look to be the same as the code we have.

On arm32 I don't know what the situation is. I know that there's a Linux system call, _ARM_cacheflush, that presumably does whatever's necessary. I don't think there's any 64-bit equivalent, so it might be that 32- and 64-bit ARM have different mechanisms for flushing.

Nicolas B. Pierron [:nbp]

Reporter

Comment 5

•

5 years ago

(In reply to Julian Seward [:jseward] from comment #4)

On arm32 I don't know what the situation is. I know that there's a Linux system call, _ARM_cacheflush, that presumably does whatever's necessary. I don't think there's any 64-bit equivalent, so it might be that 32- and 64-bit ARM have different mechanisms for flushing.

The ARMv8 reference manual has the similar section as the document mentioned in comment 0 (but I couldn't find a link to it on the internet), which is expressed in both Aarch32 and Aarch64 assembly. And while this documentation mention that we have to flush the code this way, we still have to use the isb instruction on other cores which are running the code.

Looking at the reference from the comment [1], and more precisely at at the section B2.2.5 (and B2.2.7) this seems dependent on the "Multiprocessing extension" which it-self depends whether the memory is Inner Shareable or Outer Shareable, which is either true when the "Large Physical Address Extension" is used or is otherwise implementation defined. (from what I understand)

[1] https://silver.arm.com/download/ARM_and_AMBA_Architecture/AR570-DA-70000-r0p0-00rel2/DDI0406C_C_arm_architecture_reference_manual.pdf

Lars T Hansen [:lth]

Assignee

Comment 6

•

5 years ago

All ARM systems we care about are multi-processor capable, and all memory we care about is inner-shareable since the informal meaning of inner-shareable is "the conventional shared memory in a multi-processor system".

We should of course be aware of ARMv7 systems but all systems we encounter in the future are going to be ARMv8 (whether in 32-bit or 64-bit mode).

Jan says that as far as he knows, we are completely W^X at the moment. This means that when we generate new code, there's going to be a system call on at least one thread that mprotects some memory from writable to executable.

I don't know that the W^X factoid provides the necessary guarantees - we would require that if one thread of one process on one core mprotects some memory then all cores on the system with a thread from that process have to execute an ISB. (Cores without a thread from the process would not, because they would get the effect of the ISB once a thread migrates onto the core, I would think.) It's possible the mprotect will cause a context switch on all cores that have a thread from the process, and thus do what we need it to do, maybe in order to maintain the TLB (after all the memory mappings have changed), but I don't know this for a fact.

Linux source:
https://github.com/torvalds/linux/blob/5694cecdb092656a822287a6691aa7ce668c8160/arch/arm64/include/asm/tlbflush.h#L99

mprotect calls flush_tlb_range which (according to this documentation) may execute an ISB on each CPU, but possibly only if kernel mappings are affected. The TLB maintenance operations will presumably affect all the cores but in-flight instructions on those cores might not be affected and that's the problem we're trying to solve.

We have two choke points where we can insert ISB: (a) where a thread receives fresh tier-1 code (postMessage), we can insert an ISB there; (b) in the jump table for tier-1 to tier-2 transition, if we're careful about using code there that's created just once. (Tier-2 patching patches in a stub; stub executes ISB and then patches in the real code and jumps to it; stub code ranges are never reused.)

Lars T Hansen [:lth]

Assignee

Comment 7

•

5 years ago

Luke and I have had a conversation with Jacob Bramley from ARM and the conclusion of this discussion is that, at the moment, we are (almost) fine. I'll try to summarize the discussion briefly.

The setup is that we have a jump table per module, with one slot per function. On function entry, baseline-compiled code loads the slot corresponding to the function and jumps to that address. The initial value of the slot for a function is the address of the instruction following the jump.

When we tier the module, we background-compile the code and then make the machine code available:

get a writable page (which may have previously been executable code that was freed)
write the new machine code onto the page
mprotect the page to executable
racily store pointers to the new code in the jump table

Thus baseline functions will jump to their Ion versions on entry (but baseline functions stuck in loops, say, will never reach their Ion versions).

Step 3 happens in some background compiler thread and can be expensive and slow, but once Step 4 does the store, the code immediately becomes available to N other executing threads.

Jacob says he thinks this is safe (as in, no core will be seeing old, stale instructions), based on some assumptions:

(a) Step 1 must not re-use the page(s) that held the previous version of the code.
(b) Executing cores observe the pointer updates (4) after the new code is written (2).
(c) Pointer loads and stores (4) are (single-copy) atomic.
(d) At the beginning of step 2, no core is executing anything from the writeable page(s).

All of these hold in Firefox currently, except that we probably want to execute a DSB before step 3 so that we are sure that mprotect does not risk making a page executable before all the writes have reached memory and have been observed by all cores; there's a theoretical risk of this if mprotect only operates on the local core.

What makes it safe is that while cores can read ahead and thus technically see old instructions after new instructions become available, the branch through the branch table will have been mispredicted and hence those old instructions must be discarded.

In summary: no security issue in current Firefox but we'll probably want to get that DSB in there just for the sake of being safe; but if we do more fine-grained things with patching in the future we may need to worry about getting the ISB in there.

Lars T Hansen [:lth]

Assignee

Comment 8

•

5 years ago

Of course, the hard part about executing a fence is that every core that writes into the memory that is to become executable must execute the fence, or there must be a seq-cst synchronization edge between every writing core and the core that performs the mprotect, and the latter core can execute the fence.

In our case, we usually see triples or pairs of operations:

   masm.executableCopy(...)
   ...
   ExecutableAllocator::cacheFlush(...)
   ExecutableAllocator::makeExecutable(...)

and I think we should probably assume that our synchronization logic is already good enough to get all the bytes from whatever threads produce them to the thread that calls makeExecutable(), so the easy thing is probably to put the fence in makeExecutable() or even in something it calls.

Lars T Hansen [:lth]

Assignee

Comment 9

•

5 years ago

Attached file Bug 1529933 - Execute a fence before mprotect. r?luke — Details

Lars T Hansen [:lth]

Assignee

Comment 10

•

5 years ago

I think we should open this, it's not a concern. Who possesses the necessary

Assignee: nobody → lhansen

Status: NEW → ASSIGNED

Priority: P1 → P3

Lars T Hansen [:lth]

Assignee

Comment 11

•

5 years ago

Luke, any reason we shouldn't just ask to have this opened up?

Flags: needinfo?(luke)

Luke Wagner [:luke]

Comment 12

•

5 years ago

Yes, I think we're quite safe here.

Flags: needinfo?(luke)

Lars T Hansen [:lth]

Assignee

Comment 13

•

5 years ago

Dan, can we open this up please?

Flags: needinfo?(dveditz)

Jan de Mooij [:jandem]

Updated

•

5 years ago

Group: javascript-core-security

Flags: needinfo?(dveditz)

Jan de Mooij [:jandem]

Updated

•

5 years ago

Keywords: csectype-uaf, sec-high

Pulsebot

Comment 14

•

5 years ago

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/759a68d0af0f
Execute a fence before mprotect. r=luke

Noemi Erli[:noemi_erli]

Comment 15

•

5 years ago

Backed out changeset 759a68d0af0f (Bug 1529933) for bustages in backup-point-bug1315634.js

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=759a68d0af0fbdf24b36101503d8c61a879cfc25&selectedJob=234081979

Failure log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=234081979&repo=mozilla-inbound&lineNumber=2376
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=234082016&repo=mozilla-inbound&lineNumber=2731
https://treeherder.mozilla.org/logviewer.html#?job_id=234081989&repo=mozilla-inbound

Backout: https://hg.mozilla.org/integration/mozilla-inbound/rev/6ea65bcee7cfbbce9dd6824f590f73313a9850b5

Flags: needinfo?(lhansen)

Lars T Hansen [:lth]

Assignee

Comment 16

•

5 years ago

Hah, that was completely unexpected, probably indicates a bug elsewhere...

Flags: needinfo?(lhansen)

Lars T Hansen [:lth]

Assignee

Comment 17

•

5 years ago

Ah, because we jit the atomics we will execute a fence before reprotecting the jitted code, but the jitted code for the fence has of course not been installed yet.

Pulsebot

Comment 18

•

5 years ago

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7b9be2d40a83
Execute a fence before mprotect. r=luke

Oana Pop-Rus

Comment 19

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/7b9be2d40a83

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox67: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla67

Nicolas B. Pierron [:nbp]

Reporter

Updated

•

4 years ago

Blocks: 1661231