Closed Bug 1434400 (GLitch) Opened 6 years ago Closed 6 years ago

using WebGL to Rowhammer the GPU ("GLitch")

Categories

(Core :: Graphics: CanvasWebGL, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr52 --- wontfix
firefox-esr60 --- fixed
firefox58 --- wontfix
firefox59 - wontfix
firefox60 + wontfix

People

(Reporter: dveditz, Assigned: jgilbert)

References

Details

(Keywords: sec-high, Whiteboard: [CVE-2018-10229][disclose 1442504 in advisory for release when this is public] gfx-noted )

Attachments

(1 file)

From National Cyber Security Centre (Netherlands)
- - - - - - - - - - - - - - - - - - - - - - - - - 
Attached is a paper by several Dutch researchers of the Vrije Universiteit in Amsterdam that they wish to disclose to you. We are coordinating the disclosure on their behalf. In the paper they describe a Rowhammer attack using the GPU and WebGL. The paper already describes some possible mitigations, in addition in light of the developments on Spectre and Meltdown the researchers have provided some additional insights, copied below.

The paper has been accepted at the IEEE Security&Privacy conference later this year, and is currently awaiting publication. The S&P conference have changed their publication model this year and intends to publish the paper already in February. Should you require more time, please let us know so that we can ask S&P to extend the embargo on the paper to a later date.

Main questions:
- Could you let us know whether the researchers should ask for an embargo on the paper, or can this be published in February?
- Do you expect to release patches/mitigations for this vulnerability?
- What would be the timeline for these patches/mitigations?
- What would be the preferred publication strategy for this issue?

Once we have received initial feedback from the different parties (Apple, Google (including Android ARM-manufacturers) and Mozilla), we will send an update with a proposed timeline.

Please let us know if you have any further questions.

Kind regards,
NCSC-NL




Additional information of the researchers regarding the current status:

SPECTRE for GLitch:      

None of the currently proposed countermeasures affects our attack. The current 
countermeasures (i.e., timers throttling, and array index masking in chrome) do not interfere with WebGL, therefore they don't impede our exploit since the whole attack runs on the GPU.

GLitch for SPECTRE:     
Spectre (from JS) relies on the possibility of performing fine grained timing. We introduce some timing techniques that allow you to measure cached vs uncached accesses on the CPU. 

1. We have the GL_TIMESTAMP_EXT that if combined with clock edging[1] or a more precise version of such technique that consists in running a padding function with known execution time to see if it runs within the boundaries of two edges after the access, allows you to distinguish the difference between the two accesses. 

2. We discovered there was a mistake in our evaluation of the GLSync timers and we recovered a newer resolution of ~250 ns by using ticks-to-signal. We believe that it might be possible to measure cached/uncached accesses if using the padding technique. 
WIP on this side.

MITIGATIONS: 

To tackle SPECTRE:

- We suggest for mozilla to completely disable GL_TIMESTAMP_EXT returning 0 as Chrome is currently doing and we eventually suggest to completely remove the GL_TIMESTAMP_EXT from the WebGL EXT_disjoint_timer_query extension. 

- We suggest to remove the getSyncParameter function from the WebGL2 specification since it represents a run-to-completion violation and it provides the attacker with insights of the current status of the GPU operations. 

- We suggest to modify the clientWaitSync function to accept a callback that would run after the fence gets signalled. This would make it impossible to use such function as a timer since it won't return any information regarding the status of the GPU at a certain time. 

To tackle GLitch:

To partially mitigate GLitch browser vendors should completely disable the EXT_disjoint_timer_query extension so that it would be impossible to measure the GPU execution time — besides doing what mentioned above. For the contiguous memory side channel we do not rely on the resolution of the timer since we can just loop more times over the same access pattern depending on the granularity and then compute the mean access time. As a consequence simply reducing the granularity of such timers wouldn't be the optimal solution. If that's not an option we would still suggest to reduce it in order to make the attack slower and less practical. 

FINAL NOTES:
While our main suggestion is this of disabling timers in order to (partially) block our and other possible side channels, we believe that this represents only a quick fix against the side channel threat which is effective only in the short run. However, this shouldn’t be considered a long term solution against this family of attacks since, as shown by this and previous work [1][2], it is possible to craft timers from many different JS interfaces. 
 
 
[1] Kohlbrenner, D., & Shacham, H.. Trusted Browsers for Uncertain Times. USENIX 17 
[2] Gras, B., Razavi, K., Bosman, E., Bos, H., & Giuffrida, C. (2017). ASLR on the line: Practical cache attacks on the MMU. NDSS’17

With kind regards,
...........................................................................
National Cyber Security Centre
P.O. box 117 | 2501 CC | The Hague | The Netherlands | www.ncsc.nl
Note: paper was encrypted with the wrong key. Working on getting a readable copy.
SPECTRE recommendations 1-3 seem over-aggressive. We should be fine applying the same sort of mitigations we've applied to performance.now().

Changing clientWaitSync to use a callback is not an effective change, so we should ignore that aspect of the recommendation.

I'm curious that they finger 'getSyncParameter' but not other synchronous calls, like readPixels, getBufferSubData, and finish(). Maybe I don't understand what they mean by 'run-to-completion violation'. Maybe whole-pipeline-flushing calls are ok, since they reduce the resolution enough. If so, we should be fine simply throttling fence usage.

I'll need to see the paper to know more, but I suspect the mitigation for GLitch is similar to SPECTRE mitigation 1 and performance.now().


Regardless of what we do the address untrusted input, devs will need a way to act as if trusted in order to use these tools for debugging and profiling. We should preserve their presence in the browser, but not allow them to be run by untrusted content. (a pref that doesn't persist across restarts? ENV variable?)
More specifically, the fix for these should be pretty straight-forward, though since it's Jan30, I'm worried about the paper going out in Feb already. We can definitely have this sorted by mid-Feb.


Also, run-to-completion[1] just means that it blocks the main thread/queue, which is known. (though curiously, this is the behavior of clientWaitSync, whereas getSyncParameter is actually non-blocking, though you could loop on it)

[1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/EventLoop#Run-to-completion
Ah, run-to-completion /violations/ are when run-to-completion isn't done, which is what we get for naive polling like getSyncParameter.
For background, "Trusted Browsers for Uncertain Times" is relevant:
https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_kohlbrenner.pdf
(In reply to Jeff Gilbert [:jgilbert] from comment #2)
> SPECTRE recommendations 1-3 seem over-aggressive. We should be fine applying
> the same sort of mitigations we've applied to performance.now().

Can you be more specific? The mitigations we've done with performance.now() (and all timers in fact) are really poor delaying measures. We have proofs of concept that entirely bypass any amount of timer rounding or jittering we do. 

(Also we don't do jittering yet, because it's really hard to do without letting time go backwards. I recently 'broke' Chrome's jitter by going backwards.)
Ok, but it doesn't make sense to me to fix webgl and performance.now at different rates, particularly since this would tip off that the performance.now mitigations are insufficient. I do believe we need to bring WebGL up to the same mitigation level as the rest the browser, but going further earlier seems pointless. If we are investigating further mitigations to performance.now, this work should be folded into that, so we can move forward together.

I don't have access to any of the performance.now bugs, so I can't comment on how we can integrate their findings. I do think we need to have a unified (or at least cohesive) approach here.
Assignee: nobody → jgilbert
Priority: -- → P1
Whiteboard: gfx-noted
Alias: GLitch
Summary: using WebGL to Rowhammer the GPU (plus bonus SPECTRE) → using WebGL to Rowhammer the GPU ("GLitch")
Is the attack here just using WebGL features as a timer or is it actually doing the speculative access (bypassing a bounds check et al) in WebGL?  If it's the former, then I'd agree with Jeff that just using timer clamping analogous to p.now() is  sufficient.  But if it's the latter, then I think Tom's right and we need to stop the speculative access (just like we're doing in the JIT) in *addition* to clamping/fuzzing the WebGL timer.
It's not doing spec-access, just using WebGL as a timer source with which to measure other timing attack vectors.
At this point we might want to actually pivot this bug into webgl+SPECTRE discussion, and move GLitch to its own bug.
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> It's not doing spec-access, just using WebGL as a timer source with which to
> measure other timing attack vectors.

Oh okay. Is it easy to wrap the timer call that WebGL returns with ReduceTimePrecisionAs()?  That would feed it through all of our timer clamping stuff.
(In reply to Tom Ritter [:tjr] from comment #11)
> (In reply to Jeff Gilbert [:jgilbert] from comment #9)
> > It's not doing spec-access, just using WebGL as a timer source with which to
> > measure other timing attack vectors.
> 
> Oh okay. Is it easy to wrap the timer call that WebGL returns with
> ReduceTimePrecisionAs()?  That would feed it through all of our timer
> clamping stuff.

Yep. I can get that thrown together as a starting point.
Keywords: sec-high
My initial take is that we might be OK if we neuter the timers, since this may render the physical adjacency detection ineffective.
One other mitigation that comes to mind is to throttle the number of sync objects we allow per frame or second. We might have to do both, because I think you can just edge-detect otherwise. (though you can edge-detect today already, since the basic clamping we do doesn't mitigate that. (to my knowledge))
One of the things we need to keep in mind is that many of the mitigation approaches center on having a central standard time. The GPU timestamp queries a different time counters, so we need to hitch GPU time to our coordinated system time.

Implementation aside for now, if we had a function to map real timestamps to mitigated timestamps, we may be able to apply it directly to real gpu timestamps. 

We should consider establishing an (approximate) offset to link gpu time to system time, so that we functionally expose only one clock.

Dealing with time intervals is trickier. Clamping or fuzzing time intervals separately from mitigated system time seems unsafe to me. I believe we can transform these intervals (TIME_ELAPSED) into timestamp queries and reconstruct the interval from the mitigated timestamps, which is safe. (this is the same as JS diffing performance.now queries)
After discussions with Chrome, we're going to over-restrict a bit:

- Disable disjoint_timer_query for now
- Prevent Sync objects from becoming SIGNALED until the next frame

That's what we're starting with.

We do have some lingering concerns about how essential the physical adjacency detection phase is in order to carry out the rowhammer attack, given the fairly complete understanding of the allocators involved.
publication of the paper pushed out to May 5
Whiteboard: gfx-noted → gfx-noted [Paper to be published May 5]
Awesome. That should be fine for us.

Worse news is that I suspect the physical adjacency detection is unnecessary, though I haven't proved it.

We did clarify our approach for mitigating timers though, so smooth sailing on that. (Make Sync objects act like Queries, and fix up clientWaitSync to handle the too-soon special case)
Setting 58 wontfix based on the publication being pushed out.
Jeff, what's the target for landing a mitigation here?  We're a week from 59 rc.
Flags: needinfo?(jgilbert)
It'll be in. The patch is written.
Flags: needinfo?(jgilbert)
Flags: needinfo?(milan)
Disabling disjoint timer queries: (sec bug, land asap)
https://bugzilla.mozilla.org/show_bug.cgi?id=1442504

Require round-trip for Sync objects: (public response to spec change, can probably ride the trains)
https://bugzilla.mozilla.org/show_bug.cgi?id=1442502
Did this actually land? We are building the release candidate today. Firefox 60 release  will be May 9. 
We could plan to include this in either in an RC2 build for 59, or in a dot release.
Flags: needinfo?(jgilbert)
Looks like bug 1442504 did land for 59 and esr.
Should we call this fixed for 59 then?
Depends on: 1442502
I'm removing the tracking flag because I don't think there's anything left for relman to do here for 59. If there is, please feel free to ask for tracking again.
We're due to ship Fx60 on May 9, which is 4 days after this is due to be disclosed IIUC. That seems unfortunate :(
Depends on: 1442504
Whiteboard: gfx-noted [Paper to be published May 5] → gfx-noted [Paper to be published May 5][disclose 1442504 in advisory for release when this is public]
Alias: GLitch
Flags: needinfo?(jgilbert)
[From NCSC:]

Hello all,

This week we have seen the accidental disclosure of the GLitch vulnerability as described in the "Grand Pwning Unit: Accelerating Microarchitectural Attacks with the GPU" paper of Frigo et al. Previously we have all agreed on an embargo until beginning of May. This was also communicated to the IEEE S&P conference. However, IEEE accidently published the paper at the initially proposed publication date.

The paper has been quickly removed from the IEEE page, but it has been downloaded by several readers, as can be seen by Twitter discussions. Unfortunately, one of the visitors of the IEEE page was the crawler of the Way Back Machine of archive.org, which currently still hosts a copy of the PDF. So the paper is still available to people who know the original URL, which unfortunately was also shared on Twitter.

While we have attempted to contain the disclosure, we are now forced to conclude that this containment is not perfect, and the paper should be considered disclosed to the curious public.

We propose to do a public disclosure on Wednesday 21 March at 16:00 CET.

Regards,
NCSC.
Whiteboard: gfx-noted [Paper to be published May 5][disclose 1442504 in advisory for release when this is public] → [Paper leaked, announcement March 21][disclose 1442504 in advisory for release when this is public] gfx-noted
NCSC-NL says GLitch has been assigned CVE-2018-10229
Whiteboard: [Paper leaked, announcement March 21][disclose 1442504 in advisory for release when this is public] gfx-noted → [CVE-2018-10229][disclose 1442504 in advisory for release when this is public] gfx-noted
Flags: needinfo?(milaninbugzilla)
Dan, Jeff, can this bug now be closed fixed?
Flags: needinfo?(jgilbert)
Flags: needinfo?(dveditz)
I think so. dveditz, anything else here?
Flags: needinfo?(jgilbert)
Adding Jessie (new graphics engineering manager) to all sec-crit and sec-high graphics bugs
We seem to be safe here with our removal of the perf queries the exploit used.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Group: gfx-core-security → core-security-release
Flags: needinfo?(dveditz)

Removing employee no longer with company from CC list of private bugs.

Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: