DOM bindings can generate code that holds an invalidatable typed array / array buffer live across a GC
Categories
(Core :: DOM: Bindings (WebIDL), defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | wontfix |
firefox-esr102 | --- | wontfix |
firefox-esr115 | - | wontfix |
firefox94 | --- | wontfix |
firefox95 | --- | wontfix |
firefox96 | --- | wontfix |
firefox97 | --- | wontfix |
firefox98 | --- | wontfix |
firefox109 | --- | wontfix |
firefox110 | --- | wontfix |
firefox111 | --- | wontfix |
firefox112 | --- | wontfix |
firefox113 | --- | wontfix |
firefox114 | --- | wontfix |
firefox115 | --- | wontfix |
firefox116 | --- | wontfix |
firefox117 | --- | wontfix |
firefox118 | --- | wontfix |
firefox119 | + | fixed |
People
(Reporter: sfink, Assigned: peterv)
References
(Regressed 1 open bug)
Details
(Keywords: csectype-uaf, sec-high, Whiteboard: [adv-main119+r])
Attachments
(14 files, 31 obsolete files)
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
198.32 KB,
patch
|
Details | Diff | Splinter Review | |
56.53 KB,
application/octet-stream
|
Details | |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
tjr
:
sec-approval+
|
Details | Review |
Landing bug 1680607, which was primarily for static GC hazard checks but also introduces dynamic check, seems to have uncovered a class of hazards that the static analysis misses. From one of the backout messages:
Please also check:
assertion failure -> https://treeherder.mozilla.org/logviewer?job_id=327990826&repo=autoland&lineNumber=2305
crash - https://treeherder.mozilla.org/logviewer?job_id=327990898&repo=autoland&lineNumber=2085
These are two manifestations of the same problem: there is typed array or array buffer data on the stack, perhaps as an Optional<ArrayBufferViewOrArrayBuffer>
or a RootedSpiderMonkeyInterface<Uint8Array>
. It is either live across a GC call, or passed into something that GCs. The static analysis doesn't complain because it's rooted, or because you only pass in an address (of a GC pointer). But in either case, the AutoCheckCannotGC
is active so if we end up doing something that could GC, we'll get the assertion.
I'm guessing there are more occurrences. These are legitimate UAF bugs, though they are difficult to trigger and rare in practice because you require either a nursery object or a compacting GC to break something. Not many nursery objects should be flowing through these paths, and compacting GCs usually run when there's nothing else on the stack.
Example 1 - mozilla::dom::IOUtils::Write
is called with a pointer extracted from a RootedSpiderMonkeyInterface<Uint8Array>
. The Write
can definitely GC (it's creating a Promise
), but It is possible that if you follow the data flow of the called function, the value will be dead by the time the GC happens; I haven't checked.
Example 2 - an Optional<ArrayBufferViewOrArrayBuffer>
is used after a call that could GC: TextDecoderOptions::Init
. The analysis doesn't see it because it's passing a pointer to a pointer. (The analysis does add this to a list of "unsafe references", which we could check.)
Reporter | ||
Comment 1•5 years ago
•
|
||
Looking at it more, this is looking like a pretty deep flaw in how bindings handle typed arrays & similar. When you pass one into a function generated by DOM bindings, the argument is given to the SpiderMonkeyInterface
stuff that extracts out the data pointers. It has a bunch of machinery, all of which looks correct, for making sure the owning JSObject*
is properly rooted. The problem here is that rooting isn't enough; because we've grabbed out a pointer, GC can invalidate even a rooted pointer. (Or rather, the rooted pointer itself will be updated properly, but any pointers into its data will not.)
Option 1: Since we do have stack holder objects here, I could apply the same fix that I used recently to only hang onto immovable data pointers (at a cost of using stack space for the largest possible movable data, and the performance cost of memcpy
ing the data). Right now, that's feeling pretty unsatisfying, since it's paying a non-negligible cost all the time for something that is rarely needed, and it would apply to all WebIDL interfaces that manipulate array buffers and their views.
Option 2: Another alternative would be to eliminate TypedArray_base
and friends entirely, and instead fetch the data pointers on demand. I fear that this would just make it harder to detect the underlying problem -- I'm only seeing this problem because I changed those classes to complain. If I eliminated that layer, then the problem might just recur in lots of handwritten, harder to analyze code.
Option 3: Yet another option: change the DOM bindings argument holders so that they still do all of the argument checking they do now, but they don't retrieve the problematic data pointers. Instead, when one of these argument values is needed, you'd "activate" the interface object to read out the data pointers, and then afterward Reset()
it. (It might actually reset itself whenever it is passed as a parameter, but the analysis might require the explicit Reset
to properly track what's going on.) In effect, this is similar to the previous alternative -- the interface objects would only be used for rooting, and for informing the analysis of in what ranges of code it is legal to be holding data pointers.
Option 4: The final option I can come up with is something that I tried when I first ran into TypedArray_base
problems: have an API call that converts typed arrays to using immovable storage (if they aren't already). It has some of the memcpy
drawbacks, though only once per object. The bigger problem I ran into is that TypedArray_base
-using code isn't always able to handle or propagate allocation failure properly. Sometimes it's not even on the correct thread.
Comment 2•5 years ago
|
||
To enable the analysis to see problems like this, is it possible to make the methods that return data pointers return pointers to new wrapper types that the we can tag as invalidated by GC? Then we could do option 2 (fetch data pointers on demand) and the analysis would support us.
Updated•5 years ago
|
Reporter | ||
Comment 3•5 years ago
|
||
(In reply to Jon Coppeard (:jonco) from comment #2)
To enable the analysis to see problems like this, is it possible to make the methods that return data pointers return pointers to new wrapper types that the we can tag as invalidated by GC? Then we could do option 2 (fetch data pointers on demand) and the analysis would support us.
We've tried doing something like that, with string pointers I think. They get passed to library char*
functions or stored temporarily in char*
so much that it wasn't really practical. It might be more feasible here? Though the data gets used for a lot of different things.
But with bug 1680607 landed, TypedArray_base
is pretty much exactly what you describe. It's a container for a data pointer that is explicitly retrieved by the caller (via ComputeState
), and invalidated on a GC. That enabled enough analysis support to catch two problems. The additional problems (that this bug is for) were not caught statically because (1) rooting is seen as sufficient protection when it is not, in this case; and (2) the problematic pointer is passed in as a pointer to a pointer and so we only see it as an unsafe reference. We could go through and fix our unsafe references and drive them to zero unannotated ones. It's not going to be easy, though.
That would fix the problem detection side. Though with bug 1680607, we're not doing too badly with the problem detection, it's just that it relies on some dynamic checks instead of reporting everything statically.
The harder side is fixing these for real. I guess I'd like to answer that part before figuring out what the best detection mechanism is.
One of those two statically-caught problems was extracting a pointer and using it offthread. (In fact, it was doing a limited event loop spin while the pointer was live.) That's only going to work if the pointer really cannot be invalidated (which was the fix I used.) I was willing to take the stack space and memcpy
hit at the point of pointer extraction for that one, but I'm less enthusiastic about doing that for all WebIDL interfaces that use any of these types.
I'll add an Option 5: Add an interface for converting a typed array to point to stable memory, by un-inlining objects with inline data. Additionally, have a marker to prevent compaction of that object (or arena, or AllocKind
, or whatever) while it is in scope. I attempted the un-inlining approach earlier, but I was trying to do it at a place that couldn't really handle allocation failure. In the DOM bindings code, at least, I think getting a cx and handling errors would be fine.
I keep trying to come up with an approach where we preallocate stack space, but only do the memcpy
if a GC happens. But that requires tracking and updating the pointers. TypedArray_base
or a variant StableTypedArray_base
could be used for that -- but in current practice, they're not. The pointers are extracted early and passed into and stored in various things. (Not as many as I had been expecting, given the low number of static failures with bug 1860607, but the dynamic failures may point to a bunch more.)
Updated•5 years ago
|
Assignee | ||
Comment 4•5 years ago
|
||
I am starting to think that not storing the pointers in TypedArray_base in the first place would be better. From looking at a number of places where we call ComputeState, we actually then immediately copy the data ([1], [2], and even [3]). Maybe we should have a better API for typed arrays for when we just want to copy the data anyway. The only worry is that people just blindly use that potentially more expensive API when they don't need it (I think we usually copy-paste code like that from existing callsites :-)).
We could then also add a separate stack-only class (returned from ComputeState?) just for the data pointers. It could narrow the scope where these data pointers are assumed alive and correct?
[1] https://searchfox.org/mozilla-central/source/toolkit/components/extensions/webrequest/StreamFilter.cpp#131
[2] https://searchfox.org/mozilla-central/source/layout/style/FontFace.cpp#57
[3] https://searchfox.org/mozilla-central/source/dom/websocket/WebSocket.cpp#2290
Assignee | ||
Comment 5•5 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #0)
Example 1 -
mozilla::dom::IOUtils::Write
is called with a pointer extracted from aRootedSpiderMonkeyInterface<Uint8Array>
. TheWrite
can definitely GC (it's creating aPromise
), but It is possible that if you follow the data flow of the called function, the value will be dead by the time the GC happens; I haven't checked.
This is an example of immediate copying after ComputeState, so might be able to use that proposed API for copying.
Reporter | ||
Comment 6•5 years ago
•
|
||
I started looking into a copying API. It might still require some sort of "pinning" interface, but this time because of new features: it looks like tc39 will probably add ResizeableArrayBuffers that can change views' lengths. That means, for something like your [1], that we need to know the length is not going to change between retrieving the length and copying the data. In the case of [1]:
aArray.ComputeState();
if (!aData.SetLength(aArray.Length(), fallible)) {
aRv.Throw(NS_ERROR_OUT_OF_MEMORY);
return false;
}
memcpy(aData.Elements(), aArray.Data(), aArray.Length());
The length is retrieved in ComputeState
, then used to set up the destination, and then the data is copied. I guess we could track and and let it fail? So:
size_t length = aArray.Length(); // Internally calls ComputeState, or just part of it
if (!aData.SetLength(length, fallible)) {
aRv.Throw(NS_ERROR_OUT_OF_MEMORY);
return false;
}
if (!aArray.CopyTo(aData.Elements(), aArray.Data(), length)) {
aRv.Throw(NS_ERROR_INVALID_ARG); // Or some more specific error
return false;
}
where the idea is that if length changes between the Length()
and CopyTo()
calls, it would return failure. Or perhaps only if the length increases, though honestly any change seems like a bug waiting to happen.
A stricter alternative would be to disallow typed arrays that are auto-tracking the length of an underlying ResizeableArrayBuffer, but that seems a bit weird.
Or if we were to lean on the hazard analysis more, we could teach it that accessing Length()
changes TypedArray_base
from something that needs to be rooted across a GC, into something that in unrootable and can never be live across a GC, and then "reactivate" it after CopyTo
(so back to just needing to be rooted). CopyTo
would need to discard any computed state as well.
This is piling multiple policies into a single type, so perhaps it would still be better to separate out TypedArray_base
from an unrootable TypedArrayData
type, as suggested by jonco in comment 2 and peterv in comment 4.
Reporter | ||
Comment 7•5 years ago
|
||
Note that for a long time, I was resisting relying on data flow in the hazard analysis, but more recently I've been adding more and more of it. So things like changing the nature of TypedArray_base
is not too far-fetched anymore.
Reporter | ||
Comment 8•5 years ago
|
||
Oh, duh, the Length()
thing I described in comment 6 just means that nsTArray::SetLength
needs to be understood as not GCing. (It ends up calling things through function pointers, but it's certainly annotatable.)
I'm skimming through some of the uses.
- pass in to be copied by non-GCFunction: ChromUtils.cpp
- held across scary do_GetService, then copied: Crypto.cpp
- copied in immediately called function: DOMMatrix.cpp
- passed down into all kinds of crazy scary stuff; copying small data would be fine: DOMParser
- immediate copy: nsDOMDataChannel
- copy, but not very immediate: CanvasRenderingContext2D.cpp
- enough functions are called in between that it will require work to teach the analysis that none of them can GC
- held across scary
CreateWrappingDataSourceSurface
that can release aRefPtr
: WebGLTextureUpload.cpp- it's hard to prove that
~RefPtr
won't GC
- it's hard to prove that
- held across
~nsCOMPtr
, similar to~RefPtr
: BodyExtractor.cpp
It does look like the majority of these boil down to pretty quickly copying the data. Scanning through them also makes me realize that the hazard analysis really ought to be checking this somehow. That should happen in conjunction with changing the interface to make it more possible to use safely.
Reporter | ||
Comment 9•5 years ago
|
||
Back to looking at this. I tweaked the hazard analysis so that with a function foo(mozilla::dom::Uint8Array& aData)
that accepts a reference to a TypedArray_base
subclass, that variable's type is considered to hold unrooted GC pointers. That flags any situation where a GC can happen while the parameter is still live, regardless of whether ComputeState
has been called. (Actually, if it really is one of these types and not a RootedSpidermonkeyInterface
type, then these are real hazards.) Technically, this is overzealous -- the analysis can't distinguish between pointers and references, and it's possible that some of these are doing a Handle
-like thing by passing a pointer to a rooted typedarray. But probably not.
That flagged 73 hazards, 34 of them in TestJSImplGenBinding.cpp
, which is an autogenerated test program. A number of these are bogus because they require more annotations to ignore things like PLDHashTable
function pointers that aren't going to GC. The remainder that seems definite:
- WebCryptoTask.cpp:467 : the parameter
const CryptoOperationData& aData
crosses over the call toInit
which callsJS_GetProperty
, and then goes into the callSetData(aData
). There's definitely a classic rooting hazard here with the object pointer. I'm not sure when theComputeState
happens. - Actually, a bunch of the hazards are basically the same thing in different locations in the same file.
- Promise.h:322 : standard rooting hazard.
aArgument
can be invalidated during theAutoEntryScript
constructor, and is then passed intoToJSValue
. - dom/canvas/ClientWebGLContext.cpp:3103 :
GetBufferSubData
parameterdstData
crossesValidateNonNegative
which can GC before being used. Standard rooting hazard. Given the parameter name, the data is probably overwritten, but that would still be writing after freeing. - the same thing later in the file, in
ReadPixels
.
Not too bad of a list. Note that all of the "classic rooting hazards" might also be problems with interior pointers, so just fixing the rooting problem may not be sufficient.
Comment 11•5 years ago
|
||
Steven, could you please have sfink finish the work for this sec-high bug?
Comment 12•5 years ago
|
||
I will talk with sfink off line about this bug
Updated•5 years ago
|
Comment 13•4 years ago
|
||
Steve, could you please find time to finish the work for this sec-high bug?
Reporter | ||
Comment 14•4 years ago
|
||
Work is in progress, but it's going to be gradual and probably require some fairly extensive changes. The basic issue is that the DOM bindings typed array handling is doing too many different things. I am changing the JSAPI for typed arrays to move some of that work out of the DOM code, at which time we can make smaller more targeted classes that will cover the different scenarios currently being (mis)handled by the all-encompassing classes currently being used.
Comment 15•4 years ago
|
||
Set to S2 as this is a sec-high and could possibly hit UAF.
Comment 16•4 years ago
|
||
Peter said it requires new APIs to fix this properly. He's going to schedule a meeting with Steve to discuss.
Reporter | ||
Comment 17•4 years ago
|
||
I talked with peterv and started going through uses of ComputeState
to try removing them to see what new APIs we need. I'll do a recap of my findings so far.
The set of hazards we're trying to resolve:
- the standard GC rooting hazards: something holds onto a
dom::TypedArray_base
subclass, we do a GC, the JSObject is deleted. - the object is rooted (eg via
RootedSpiderMonkeyInterface
), but the data is stored inline in theJSObject
and gets moved during a GC - the data is moved while the object is not rooted
- whether the object is rooted or not, the length changes as a result of JS code running in between when the length is retrieved and used
That last one can only happen if you detach the underlying ArrayBuffer
right now, but we will be adding resizable ArrayBuffer
s which will introduce another way for it to happen. I haven't kept up with the latest spec proposals, but I think the length can change to any value: zero, smaller but nonzero, and larger.
Inline data will always be fairly small. Right now, it looks to be at most 96 bytes on 64-bit.
The set of facilities I'm thinking of having to handle this code are:
- A
FixedData
() getter that returns data guaranteed not to move during a GC, by requiring the caller to pass in a small buffer. If the data is inline, the data will be copied into the buffer and a pointer to the buffer returned. Otherwise, it will return the actual malloc pointer. - data and length accessors that require the caller to pass in a
nogc
parameter. This is not ironclad; the caller could still hang onto the data pointer or length past the end of the lifetime of that parameter. But at least you have to work to make things unsafe, and it enables the static analysis to double-check whether any GC is possible within thenogc
scope. - a
CopyData()
accessor that copies the data without exposing any pointers at all. Most of the time, that's all that's happening anyway, though the copy is sometimes buried a couple of layers down.
One limitation of the FixedData()
approach is that it does not require the object to be rooted. You could call that method on an unrooted structure, do a GC, and have the malloc pointer freed. I'm still working out if there's a way to require rooting for it.
I will post some patches with my current revision of the APIs needed to remove ComputeState
, together with a couple of patches where I use them in a small handful of places, and peterv will look into going through the rest of the code. When I finish up a few other things, I hope to join in the fun.
Once ComputeState
is gone, then we can mark dom::TypedArray_base
as a GC pointer, and the hazard analysis should be able to double-check our work.
Reporter | ||
Comment 18•4 years ago
|
||
Reporter | ||
Comment 19•4 years ago
|
||
Reporter | ||
Comment 20•4 years ago
|
||
Reporter | ||
Comment 21•4 years ago
|
||
Reporter | ||
Comment 22•4 years ago
|
||
Reporter | ||
Comment 23•4 years ago
|
||
Reporter | ||
Comment 24•4 years ago
|
||
Reporter | ||
Comment 25•4 years ago
|
||
Reporter | ||
Comment 26•4 years ago
|
||
Reporter | ||
Comment 27•4 years ago
|
||
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Comment 28•4 years ago
|
||
Another wrinkle with the FixedData
approach is that because it does not require passing in a nogc
token, that it is then possible to run JS code while holding the pointer. And that JS code could detach the ArrayBuffer or view, making the pointer invalid (or equivalently, making the valid length be zero). One of the patches in this series adds a FreezeLength()
API that returns the length but also makes anything that might change it fail until UnfreezeLength()
is called. It's a pretty heavy-handed approach, and it causes the failure to happen when you try to detach rather than when you try to use the length, but it's safer. I'm still questioning whether it's a good idea.
Updated•4 years ago
|
Reporter | ||
Comment 29•4 years ago
|
||
Reporter | ||
Comment 30•4 years ago
|
||
Note that at least one of the fixes here will produce a false alarm hazard, because the analysis thinks a forgotten refptr could GC in its destructor. I'm not sure if I'm supposed to be dependencies on insecure bugs from secure bugs, but the fix is bug 1746090.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 31•4 years ago
|
||
Status update - Steve has created some new APIs that we're trying out and did some example fixes using them. The next step is for Peter to see how far that could get us.
Updated•4 years ago
|
Comment 32•4 years ago
|
||
Status update - Steve and Peter made good progress on this. New APIs Steve added are feasible. Peter is applying the new APIs to the surrounding code but still a lot to do.
Comment 33•3 years ago
|
||
Assigned to Peter as he has been working on this.
Comment 34•3 years ago
|
||
Status update - Peter already converted many callers into new APIs, still have 9-10 to be converted.
Assignee | ||
Comment 35•3 years ago
|
||
Assignee | ||
Comment 36•3 years ago
|
||
Depends on D152494
Assignee | ||
Comment 37•3 years ago
|
||
Depends on D152495
Assignee | ||
Comment 38•3 years ago
|
||
Depends on D152496
Assignee | ||
Comment 39•3 years ago
|
||
Depends on D152497
Comment 40•3 years ago
|
||
Updates - Peter is still working on this. In his WIP, he is able to convert all the use cases as well as new ones he discovered with the new APIs he implemented.
Comment 41•3 years ago
|
||
Updates - Peter and Steve had a meeting to review/discuss WIPs two weeks ago. Peter is currently addressing the comments.
Comment 42•3 years ago
|
||
Updates - still working on this. Pushed the latest patches to try, and has been investigating failures.
Comment 43•3 years ago
|
||
Peter - would you please give a brief update on this bug? Thank you.
Assignee | ||
Comment 44•3 years ago
|
||
I have been chasing a weird issue, which I finally found the cause of (accessing the data of a zero-length span tries to access an invalid pointer). That sees to have fixed the last failing test I was seeing.
We found one problematic DOM API that looks vulnerable, but luckily it's a chrome-only API so we don't think we need to worry about it (and the patches with the new API will fix it).
Updated•3 years ago
|
Assignee | ||
Comment 45•3 years ago
|
||
Depends on D131301
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 46•3 years ago
|
||
Assignee | ||
Comment 47•3 years ago
|
||
Depends on D152498
Comment 48•3 years ago
|
||
Updates - patches in preview. Peter and Steve were running more static analysis, some tweaks in patches on the way.
Comment 49•3 years ago
|
||
Hi Steve, my understanding was that you were looking at static analysis issues before the holiday break. Any updates since then? Thanks.
Comment 50•3 years ago
|
||
Updates per the chat with Steve - before the holiday, Steve looked into the static analysis warnings caused by Peter's patches. Steve changed the API slightly to make the region smaller where the analysis would look for problems (calls to functions that could GC) that reduced the number of warnings down to zero. However, the "zero warning" result is because of a bug in the analysis. So the next step here is that Steve improves the analysis so that it is able to check the smaller regions correctly.
Reporter | ||
Comment 51•3 years ago
|
||
Specifically, I am now passing in an AutoCheckCannotGC&&
parameter to the lambdas passed to eg ProcessTypedArrays
. Unfortunately, the compiler converts any reference, even an rvalue reference, to a pointer before I see it. So this appears as if it were of type AutoCheckCannotGC*
, which is not considered to be a GC pointer (it's a pointer to something considered to be a GC pointer, but the GC doesn't invalidate pointers to pointers).
This is coming up now because previously nothing would pass these token types by reference. I added logic to end an AutoCheckCannotGC
's live range if it was passed to std::move
to fix a tricky hazard (false alarm) that the changes in this bug created.
I am attempting a workaround: for types T like JS::AutoCheckCannotGC
, consider a T* to also be a GC pointer. (A T** will still not, though hopefully those won't exist in practice!) That will prevent passing pointers to these things, but you probably shouldn't be doing that anyway?
I'll see how it goes. Hacks like this often cause unexpected problems.
Reporter | ||
Comment 52•3 years ago
|
||
Current progress: I implemented the rvalue ref fix. That showed 58 hazards.
The first hazard was where ProcessTypedArraysFixed
called ApplyToTypedArrays
, which still used the lambda mechanism that passes in an AutoCheckCannotGC&&
. So since the actual processing GCed all over the place (which is why it was using *Fixed
in the first place), it reported a hazard. I looked at splitting out an ApplyToTypedArraysFixed
, but that seemed like a lot of duplicate template-heavy code, so instead I called nogc.reset()
to end the GC-forbidding scope.
That "reduced" the hazards from 58 -> 78. Typical.
It looks like my nogc.reset()
isn't getting picked up, probably because the codegen type of nogc
here is AutoCheckCannotGC*
and so there's an extra dereference wherever it is used, and I'm not handling it for the purpose of identifying the reset()
-like calls.
Fixing that did 78 -> 76. Progress, I guess?
The same hazard turned into an only slightly different hazard, because there's still an AutoCheckCannotGC
in ApplyToTypedArray::Apply
, which is called by ProcessTypedArraysFixed
. So it seems like I can't avoid adding in the whole separate Fixed path.
Implementing takes the hazard count 76 -> 476. :-(
Almost all of these hazards are from calling various WebGL functions, because they can trigger calls to JS::WarnUTF8()
which ends up calling through a function pointer to a warningReporter
. Both the signature of these functions as well as the comments seem to suggest that they really could create an Error object and therefore GC.
At this point, I guess I could say that in the error case, the movable data will no longer be used and so this doesn't matter. But I would want to verify that this is in fact the case first. If it is, then I see two options: (1) thread through an AutoCheckCannotGC&&
and reset()
it when the problem is discovered and we're ok with mangling that data; or (2) lie and claim that the warning reporter will never GC. (Or make it so it never does, but I don't know if that's doable. It sounds like these warnings can end up triggering errors? But maybe if that's only on OOM, then it'll be ok...?)
Posting this as a record of the current status.
Reporter | ||
Comment 53•3 years ago
|
||
I implemented a class AutoIgnoreGCForAnalysis
for the specific scenario here: you have a function that only GCs if it fails, and if it fails, callers will not use the invalidated data. That took the count from 476 → 476.
Side note: The reason why these counts often go up when I make "progress" is typically that you start with a hazard in function F, where you're holding an AutoCheckCannotGC
live across a GC. You decide that it's too coarse, and std::move
the nogc
into the callee, adding it as a parameter in the process. Now you have a new hazard in the callee wherever it GCs. Only the callee is a macro-generated templatized method with 400 separate instantiations, and so you get a separate hazard reported for each. It would be nice to have an indicator of how "deep" the hazards are, so that when fixing a top-level hazard exposes 100 lower down, you can tell that you're making progress...
The next problem here is that the hazard shifted to ClientWebGLContext::UniformData
being able to GC. It's the same basic problem—it can GC on an error condition, and we've created an AutoCheckCannotGC
token disconnected from the data that says that any GC is bad, regardless of whether the data is accessed. It's not the only reason, though; even in a successful run, GetActiveLinkData
is called and can GC. That appears to be through logging and is probably fake.
The natural solution at this point is to (1) fix the logging GC and (2) use AutoIgnoreGCForAnalysis
again. But the latter bothers me, because it would be really really easy to have it in slightly the wrong place and conceal a real hazard. What would be slightly better would be to have a specific mechanism for "will GC only on an error". And I think it could be done: create a new annotation JS_HAZ_GC_IF_ERROR, give the function a boolean return value, and return false on error. Then in the caller, holding a GC pointer live across a regular GC call will still be a hazard, but holding a GC pointer live across a JS_HAZ_GC_IF_ERROR call only matters if it returns true. It won't be easy, but it's doable. And it could replace the use of AutoIgnoreGCForAnalysis
.
(It also won't fix the problem here on its own. I'll still need to annotate away the logging GC, as well as a GC from a webgl::LinkResult
falling out of scope.)
I will first do a larger suppression to check whether there's something worse waiting behind it, and then decide whether it's the right thing to try next. That did 476 → 133. I was hoping for more, but that's good at least.
There are still many hazards in the webgl code. The next fix was for some control flow through std::forward
that the analysis couldn't follow. Fixing that did 133 → 124.
Reporter | ||
Comment 54•3 years ago
|
||
Reporter | ||
Comment 55•3 years ago
|
||
I managed to push through the information about what parameters are rvalue references to remove the heuristic guessing. That did 124 → 55.
The first of those looks like it may be a legitimate hazard in mozilla::dom::SourceBuffer::PrepareAppend
. I tentatively fixed that by using ProcessFixedData
.
Next is a hazard where std::forward
is again losing track of what's going on, as a result of the rvalue reference change. Re-fixing that, plus fixing a minor hazard resulting from an unused RefPtr, does 55 → 39. Now we're getting somewhere!
The next hazards are all of the same form: something calls ProcessData
and does some data validation, triggering an error if the validation fails. The error path can GC, which does not matter because in that case the input data will not be used. Those are fixable by manually doing nogc.reset()
in the error case, so that the non-error case will still prevent any GC from happening.
Fixing a handful of those, plus some minor code motion to move code out of the GC unsafe critical section, does 39 → 25.
That list still contains some large clusters of similar-looking hazards, so that number should continue to drop pretty quickly as I fix them.
Reporter | ||
Comment 56•3 years ago
|
||
Updated•3 years ago
|
Reporter | ||
Comment 57•3 years ago
|
||
Did a round of fixes, which did 25 → 20. I think I may need to talk to peterv to work through some of the remainders.
Reporter | ||
Comment 58•3 years ago
|
||
Now cleaning up and landing bits and pieces of the parts of my patch stack that are improvements in the capabilities of the hazard analysis. Until the rest of this stuff lands, they shouldn't introduce any more hazards or if they do, those should be cleaned up separately.
The ~shared_ptr
stuff is working. Current hazard count is 13. I think 2 of those are a problem with an annotation getting lost that I still need to look into. It's not related to the DOM ArrayBuffer changes here.
I'll upload my current bundle.
Reporter | ||
Comment 59•3 years ago
|
||
Reporter | ||
Comment 60•3 years ago
|
||
I took a stab at rebasing. There were some difficult changes, some seemingly holding volatile pointers across Promise steps, which isn't going to work. I made everything compile, but very possibly broke stuff in the process, and some of that stuff is going to need further changes to be GC safe.
Reporter | ||
Comment 61•3 years ago
|
||
Oops, used the wrong base for the bundle.
Updated•3 years ago
|
Reporter | ||
Comment 62•3 years ago
|
||
I reworked this to make ProcessFixedData
rewrite an inline ArrayBuffer
or a buffer-less view to move the data to out-of-line, then used the new ProcessFixedData
very liberally to resolve all remaining hazards. I have not done test runs for either performance or correctness, mostly because we haven't pushed this stuff to the try server yet for security reasons. This is where it would be nice to have a secure try server!
Time to start cleaning it up for review.
Updated•3 years ago
|
Reporter | ||
Comment 63•3 years ago
|
||
Sorry, delayed update here. I've been working on other things, and slowly landing the last of the hazard analysis changes needed.
All of the necessary hazard analysis changes have now landed.
What remains is to create a patch stack that modifies the added APIs, rebases the existing fixes here on top of current code, and updates them to the modified APIs.
I am working on an example stack. I will get it to the point where it compiles and has zero hazards, then get help in figuring out how to split it up to change each Gecko file once or at least in an orderly way. (Right now my stack has the original added APIs, Gecko changes to use them, incrementally modified APIs with incremental updates to the Gecko code, and then often a final fixup that modifies the Gecko code again.)
Note that I had all hazards handled at one point in time, but newer changes may introduce more. Also, some of my rebasing is suspect; I worried more about getting something that would compile than something that was correct! That's one place help will be needed.
Updated•2 years ago
|
Reporter | ||
Comment 64•2 years ago
|
||
Reporter | ||
Comment 65•2 years ago
|
||
I'm going on PTO for a bit, so I am uploading the current state again. This still has 3 hazards, all from the same source, which is an annoying case where we are erroring out and the error handling can trigger a GC. Threading through a consumable AutoCheckCannotGC&&
would fix it. So would more heavily using ProcessFixedData
. But I'm running out of time to try out a fix, and it's not that relevant in the grand scheme of things. I can fix it later.
This patch stack is in good enough state to look at to see how things are probably going to need to end up. It isn't reviewable, though. Too many things that change it one way, then another, and then settle on a final shape. For some of the more heavily-impacted files, I'm planning on extracting out all the change those those files and squashing them together if they don't cross over any of the significant API change points.
I am also going to start submitting some of the infrastructure pieces for review.
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 66•2 years ago
|
||
Reporter | ||
Comment 67•2 years ago
|
||
Reporter | ||
Comment 68•2 years ago
|
||
Reporter | ||
Comment 69•2 years ago
|
||
Error reporting is a common source of GCs that produce false positives in the hazard analysis: you are working with some GC-volatile data (eg a GC pointer or a pointer to data stored inline in a GC thing) and something goes wrong, so you construct an error object to throw. The allocation can GC, even though the data will no longer be used after the GC. Passing the nogc token into the lambdas passed to ProcessData() allows telling the analysis "it's ok to GC now, I won't use the data anymore" by calling nogc.reset()
or passing it via std::move(nogc)
as an rvalue reference to some other function that consumes it.
Reporter | ||
Comment 70•2 years ago
|
||
Comment 71•2 years ago
|
||
Hi Peter/Steve, could you please share the status since comment 65? Thank you.
Updated•2 years ago
|
Reporter | ||
Comment 72•2 years ago
|
||
Reporter | ||
Comment 73•2 years ago
|
||
Updated•2 years ago
|
Reporter | ||
Comment 74•2 years ago
|
||
Updated•2 years ago
|
Reporter | ||
Comment 75•2 years ago
|
||
Reporter | ||
Comment 76•2 years ago
|
||
Reporter | ||
Comment 77•2 years ago
|
||
It is surprisingly difficult to prove that CurrentTime() will not GC.
Reporter | ||
Comment 78•2 years ago
|
||
Reporter | ||
Comment 79•2 years ago
|
||
Reporter | ||
Comment 80•2 years ago
|
||
Current status is that the stack is freshly rebased on top of mozilla-central, all hazards are resolved, and I have uploaded most of the current stack to phabricator. The remaining patches that I need to massage into shape are:
s22 31f0f1258a99 [AB.haz] [hazards] Annotate GL function pointers
s23 3fab89a9c141 [AB.haz] Incomplete attempt to fix random number generation hazards
s24 293082c4a134 [AB.haz] Rearrange Crypto.cpp code to avoid hazard
s25 9d229572c8f4 [AB.haz] Annotate away WebGL graphics hazards
s26 b5c72632cf81 [AB.haz] [hazards] Use same naming scheme and type for several annotations
s27 a9c0af0435a6 [AB.haz] Annotate a function pointer in profiler AddMarkerToBuffer()
s28 20a2de382aff [AB.haz] Fix hazards in compression streams
s29 2ec97a18e768 [AB.haz] PK11_* functions are low level eand will not GC
but some of my fixes are a little arduous, particularly in ClientWebGLContext.cpp. It is to the point where it would be useful to get feedback on whether we need to make any of this cleaner or faster (the stack makes pretty heavy use of ProcessedFixed*, which copy small data out of line). Tomorrow I will continue uploading the rest of the stack.
Reporter | ||
Comment 81•2 years ago
|
||
Assignee | ||
Comment 82•2 years ago
|
||
I am currently reviewing the new patches, and figuring out how to simplify the whole stack.
Assignee | ||
Comment 83•2 years ago
|
||
I am still reordering the stack and pulling patches into logical units.
Assignee | ||
Comment 85•2 years ago
|
||
Figuring how to make the new API work with the MOZ_CAN_RUN_SCRIPT static analysis.
Updated•2 years ago
|
Reporter | ||
Comment 86•2 years ago
|
||
Assignee | ||
Comment 87•2 years ago
|
||
Assignee | ||
Comment 88•2 years ago
|
||
Depends on D185067
Updated•2 years ago
|
Assignee | ||
Comment 89•2 years ago
|
||
Depends on D131301
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 90•2 years ago
|
||
Assignee | ||
Comment 91•2 years ago
|
||
Depends on D152498
Assignee | ||
Comment 92•2 years ago
|
||
Comment on attachment 9346749 [details]
Bug 1690111 - Improve nsIRandomGenerator APIs. r?mccr8!
Security Approval Request
- How easily could an exploit be constructed based on the patch?: The patches do point out the potential security issues. The old API makes writing vulnerable code much easier. Both me and Steve Fink have looked over the existing users of the old API and we don't think we're currently vulnerable, but I don't think we're completely sure. But I don't think it'd be very easy to write an exploit.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Yes
- Which older supported branches are affected by this flaw?: All
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: Yes
- If not, how different, hard to create, and risky will they be?: Should just apply to beta and I have patches for esr115.
- How likely is this patch to cause regressions; how much testing does it need?: This is a bit of a risky patch because it affects a lot of areas of the code and it changes out a complete API for another. We probably need to let it bake for a couple of releases before merging it to branches. As mentioned before, we don't think we're currently vulnerable, so the security risk of delaying shipping on esr should be fairly low.
- Is Android affected?: Yes
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Comment 93•2 years ago
|
||
Comment on attachment 9251039 [details]
Bug 1690111 - Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views.
Consider the stack sec-approved to land and start the bake.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Comment 94•2 years ago
|
||
Comment 95•2 years ago
|
||
![]() |
||
Comment 96•2 years ago
|
||
Backed out for causing fetch related crashes:
https://hg.mozilla.org/integration/autoland/rev/3c19998bbddf59f5104ceda12187c275fbf63981
Failure logs:
https://treeherder.mozilla.org/logviewer?job_id=428691507&repo=autoland
https://treeherder.mozilla.org/logviewer?job_id=428689709&repo=autoland
https://treeherder.mozilla.org/logviewer?job_id=428691731&repo=autoland
![]() |
||
Comment 97•2 years ago
|
||
Comment 98•2 years ago
|
||
Comment 99•2 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/423cb05856e8
https://hg.mozilla.org/mozilla-central/rev/9497d4dcd737
https://hg.mozilla.org/mozilla-central/rev/719bfaded837
https://hg.mozilla.org/mozilla-central/rev/9589445bbb52
https://hg.mozilla.org/mozilla-central/rev/603e8067e68f
https://hg.mozilla.org/mozilla-central/rev/e39ae7aad8d9
https://hg.mozilla.org/mozilla-central/rev/88e9e7646e0d
https://hg.mozilla.org/mozilla-central/rev/338049100f37
https://hg.mozilla.org/mozilla-central/rev/8acc055a56d5
https://hg.mozilla.org/mozilla-central/rev/fab63d21d38b
https://hg.mozilla.org/mozilla-central/rev/fe6fd98dac14
https://hg.mozilla.org/mozilla-central/rev/b9bbd30d9797
Updated•2 years ago
|
Assignee | ||
Comment 101•2 years ago
•
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #100)
What's the plan for ESR115 approval here?
So far it stuck pretty well on trunk, but I think we should still wait a bit longer for regressions (in fact bug 1856672 just came in). Hopefully by 115.5esr we should have more confidence.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 102•2 years ago
|
||
:peterv do you still plan on uplifting this to esr for 115.5? if so, can you nominate it for uplift?
Assignee | ||
Comment 103•2 years ago
|
||
I think that given the crash volume and lack of progress in bug 1856672 I would rather postpone.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 104•2 years ago
|
||
Peter, given that bug 1856672 doesn't seem to have progressed, should we more forward with backporting this to ESR? Or give up on it entirely?
Updated•2 years ago
|
Assignee | ||
Comment 105•2 years ago
|
||
Yeah, at this point it seems better to give up. If we do figure out a fix for bug 1856672 in time I'll come back to this and nominate, but it's looking unlikely.
Updated•2 years ago
|
Comment 106•1 years ago
|
||
Bulk-unhiding security bugs fixed in Firefox 119-121 (Fall 2023). Use "moo-doctrine-subsidy" to filter
Description
•