Closed Bug 1690111 Opened 5 years ago Closed 2 years ago

DOM bindings can generate code that holds an invalidatable typed array / array buffer live across a GC

Categories

(Core :: DOM: Bindings (WebIDL), defect)

defect

Tracking

()

RESOLVED FIXED
119 Branch
Tracking Status
firefox-esr91 --- wontfix
firefox-esr102 --- wontfix
firefox-esr115 - wontfix
firefox94 --- wontfix
firefox95 --- wontfix
firefox96 --- wontfix
firefox97 --- wontfix
firefox98 --- wontfix
firefox109 --- wontfix
firefox110 --- wontfix
firefox111 --- wontfix
firefox112 --- wontfix
firefox113 --- wontfix
firefox114 --- wontfix
firefox115 --- wontfix
firefox116 --- wontfix
firefox117 --- wontfix
firefox118 --- wontfix
firefox119 + fixed

People

(Reporter: sfink, Assigned: peterv)

References

(Regressed 1 open bug)

Details

(Keywords: csectype-uaf, sec-high, Whiteboard: [adv-main119+r])

Attachments

(14 files, 31 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
198.32 KB, patch
Details | Diff | Splinter Review
56.53 KB, application/octet-stream
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Landing bug 1680607, which was primarily for static GC hazard checks but also introduces dynamic check, seems to have uncovered a class of hazards that the static analysis misses. From one of the backout messages:

Please also check:

assertion failure -> https://treeherder.mozilla.org/logviewer?job_id=327990826&repo=autoland&lineNumber=2305
crash - https://treeherder.mozilla.org/logviewer?job_id=327990898&repo=autoland&lineNumber=2085

These are two manifestations of the same problem: there is typed array or array buffer data on the stack, perhaps as an Optional<ArrayBufferViewOrArrayBuffer> or a RootedSpiderMonkeyInterface<Uint8Array>. It is either live across a GC call, or passed into something that GCs. The static analysis doesn't complain because it's rooted, or because you only pass in an address (of a GC pointer). But in either case, the AutoCheckCannotGC is active so if we end up doing something that could GC, we'll get the assertion.

I'm guessing there are more occurrences. These are legitimate UAF bugs, though they are difficult to trigger and rare in practice because you require either a nursery object or a compacting GC to break something. Not many nursery objects should be flowing through these paths, and compacting GCs usually run when there's nothing else on the stack.

Example 1 - mozilla::dom::IOUtils::Write is called with a pointer extracted from a RootedSpiderMonkeyInterface<Uint8Array>. The Write can definitely GC (it's creating a Promise), but It is possible that if you follow the data flow of the called function, the value will be dead by the time the GC happens; I haven't checked.

Example 2 - an Optional<ArrayBufferViewOrArrayBuffer> is used after a call that could GC: TextDecoderOptions::Init. The analysis doesn't see it because it's passing a pointer to a pointer. (The analysis does add this to a list of "unsafe references", which we could check.)

Looking at it more, this is looking like a pretty deep flaw in how bindings handle typed arrays & similar. When you pass one into a function generated by DOM bindings, the argument is given to the SpiderMonkeyInterface stuff that extracts out the data pointers. It has a bunch of machinery, all of which looks correct, for making sure the owning JSObject* is properly rooted. The problem here is that rooting isn't enough; because we've grabbed out a pointer, GC can invalidate even a rooted pointer. (Or rather, the rooted pointer itself will be updated properly, but any pointers into its data will not.)

Option 1: Since we do have stack holder objects here, I could apply the same fix that I used recently to only hang onto immovable data pointers (at a cost of using stack space for the largest possible movable data, and the performance cost of memcpying the data). Right now, that's feeling pretty unsatisfying, since it's paying a non-negligible cost all the time for something that is rarely needed, and it would apply to all WebIDL interfaces that manipulate array buffers and their views.

Option 2: Another alternative would be to eliminate TypedArray_base and friends entirely, and instead fetch the data pointers on demand. I fear that this would just make it harder to detect the underlying problem -- I'm only seeing this problem because I changed those classes to complain. If I eliminated that layer, then the problem might just recur in lots of handwritten, harder to analyze code.

Option 3: Yet another option: change the DOM bindings argument holders so that they still do all of the argument checking they do now, but they don't retrieve the problematic data pointers. Instead, when one of these argument values is needed, you'd "activate" the interface object to read out the data pointers, and then afterward Reset() it. (It might actually reset itself whenever it is passed as a parameter, but the analysis might require the explicit Reset to properly track what's going on.) In effect, this is similar to the previous alternative -- the interface objects would only be used for rooting, and for informing the analysis of in what ranges of code it is legal to be holding data pointers.

Option 4: The final option I can come up with is something that I tried when I first ran into TypedArray_base problems: have an API call that converts typed arrays to using immovable storage (if they aren't already). It has some of the memcpy drawbacks, though only once per object. The bigger problem I ran into is that TypedArray_base-using code isn't always able to handle or propagate allocation failure properly. Sometimes it's not even on the correct thread.

To enable the analysis to see problems like this, is it possible to make the methods that return data pointers return pointers to new wrapper types that the we can tag as invalidated by GC? Then we could do option 2 (fetch data pointers on demand) and the analysis would support us.

Group: core-security → dom-core-security

(In reply to Jon Coppeard (:jonco) from comment #2)

To enable the analysis to see problems like this, is it possible to make the methods that return data pointers return pointers to new wrapper types that the we can tag as invalidated by GC? Then we could do option 2 (fetch data pointers on demand) and the analysis would support us.

We've tried doing something like that, with string pointers I think. They get passed to library char* functions or stored temporarily in char* so much that it wasn't really practical. It might be more feasible here? Though the data gets used for a lot of different things.

But with bug 1680607 landed, TypedArray_base is pretty much exactly what you describe. It's a container for a data pointer that is explicitly retrieved by the caller (via ComputeState), and invalidated on a GC. That enabled enough analysis support to catch two problems. The additional problems (that this bug is for) were not caught statically because (1) rooting is seen as sufficient protection when it is not, in this case; and (2) the problematic pointer is passed in as a pointer to a pointer and so we only see it as an unsafe reference. We could go through and fix our unsafe references and drive them to zero unannotated ones. It's not going to be easy, though.

That would fix the problem detection side. Though with bug 1680607, we're not doing too badly with the problem detection, it's just that it relies on some dynamic checks instead of reporting everything statically.

The harder side is fixing these for real. I guess I'd like to answer that part before figuring out what the best detection mechanism is.

One of those two statically-caught problems was extracting a pointer and using it offthread. (In fact, it was doing a limited event loop spin while the pointer was live.) That's only going to work if the pointer really cannot be invalidated (which was the fix I used.) I was willing to take the stack space and memcpy hit at the point of pointer extraction for that one, but I'm less enthusiastic about doing that for all WebIDL interfaces that use any of these types.

I'll add an Option 5: Add an interface for converting a typed array to point to stable memory, by un-inlining objects with inline data. Additionally, have a marker to prevent compaction of that object (or arena, or AllocKind, or whatever) while it is in scope. I attempted the un-inlining approach earlier, but I was trying to do it at a place that couldn't really handle allocation failure. In the DOM bindings code, at least, I think getting a cx and handling errors would be fine.

I keep trying to come up with an approach where we preallocate stack space, but only do the memcpy if a GC happens. But that requires tracking and updating the pointers. TypedArray_base or a variant StableTypedArray_base could be used for that -- but in current practice, they're not. The pointers are extracted early and passed into and stored in various things. (Not as many as I had been expecting, given the low number of static failures with bug 1860607, but the dynamic failures may point to a bunch more.)

I am starting to think that not storing the pointers in TypedArray_base in the first place would be better. From looking at a number of places where we call ComputeState, we actually then immediately copy the data ([1], [2], and even [3]). Maybe we should have a better API for typed arrays for when we just want to copy the data anyway. The only worry is that people just blindly use that potentially more expensive API when they don't need it (I think we usually copy-paste code like that from existing callsites :-)).

We could then also add a separate stack-only class (returned from ComputeState?) just for the data pointers. It could narrow the scope where these data pointers are assumed alive and correct?

[1] https://searchfox.org/mozilla-central/source/toolkit/components/extensions/webrequest/StreamFilter.cpp#131
[2] https://searchfox.org/mozilla-central/source/layout/style/FontFace.cpp#57
[3] https://searchfox.org/mozilla-central/source/dom/websocket/WebSocket.cpp#2290

(In reply to Steve Fink [:sfink] [:s:] from comment #0)

Example 1 - mozilla::dom::IOUtils::Write is called with a pointer extracted from a RootedSpiderMonkeyInterface<Uint8Array>. The Write can definitely GC (it's creating a Promise), but It is possible that if you follow the data flow of the called function, the value will be dead by the time the GC happens; I haven't checked.

This is an example of immediate copying after ComputeState, so might be able to use that proposed API for copying.

I started looking into a copying API. It might still require some sort of "pinning" interface, but this time because of new features: it looks like tc39 will probably add ResizeableArrayBuffers that can change views' lengths. That means, for something like your [1], that we need to know the length is not going to change between retrieving the length and copying the data. In the case of [1]:

  aArray.ComputeState();
  if (!aData.SetLength(aArray.Length(), fallible)) {
    aRv.Throw(NS_ERROR_OUT_OF_MEMORY);
    return false;
  }
  memcpy(aData.Elements(), aArray.Data(), aArray.Length());

The length is retrieved in ComputeState, then used to set up the destination, and then the data is copied. I guess we could track and and let it fail? So:

  size_t length = aArray.Length(); // Internally calls ComputeState, or just part of it
  if (!aData.SetLength(length, fallible)) {
    aRv.Throw(NS_ERROR_OUT_OF_MEMORY);
    return false;
  }
  if (!aArray.CopyTo(aData.Elements(), aArray.Data(), length)) {
    aRv.Throw(NS_ERROR_INVALID_ARG); // Or some more specific error
    return false;
  }

where the idea is that if length changes between the Length() and CopyTo() calls, it would return failure. Or perhaps only if the length increases, though honestly any change seems like a bug waiting to happen.

A stricter alternative would be to disallow typed arrays that are auto-tracking the length of an underlying ResizeableArrayBuffer, but that seems a bit weird.

Or if we were to lean on the hazard analysis more, we could teach it that accessing Length() changes TypedArray_base from something that needs to be rooted across a GC, into something that in unrootable and can never be live across a GC, and then "reactivate" it after CopyTo (so back to just needing to be rooted). CopyTo would need to discard any computed state as well.

This is piling multiple policies into a single type, so perhaps it would still be better to separate out TypedArray_base from an unrootable TypedArrayData type, as suggested by jonco in comment 2 and peterv in comment 4.

Note that for a long time, I was resisting relying on data flow in the hazard analysis, but more recently I've been adding more and more of it. So things like changing the nature of TypedArray_base is not too far-fetched anymore.

Oh, duh, the Length() thing I described in comment 6 just means that nsTArray::SetLength needs to be understood as not GCing. (It ends up calling things through function pointers, but it's certainly annotatable.)

I'm skimming through some of the uses.

  • pass in to be copied by non-GCFunction: ChromUtils.cpp
  • held across scary do_GetService, then copied: Crypto.cpp
  • copied in immediately called function: DOMMatrix.cpp
  • passed down into all kinds of crazy scary stuff; copying small data would be fine: DOMParser
  • immediate copy: nsDOMDataChannel
  • copy, but not very immediate: CanvasRenderingContext2D.cpp
    • enough functions are called in between that it will require work to teach the analysis that none of them can GC
  • held across scary CreateWrappingDataSourceSurface that can release a RefPtr: WebGLTextureUpload.cpp
    • it's hard to prove that ~RefPtr won't GC
  • held across ~nsCOMPtr, similar to ~RefPtr: BodyExtractor.cpp

It does look like the majority of these boil down to pretty quickly copying the data. Scanning through them also makes me realize that the hazard analysis really ought to be checking this somehow. That should happen in conjunction with changing the interface to make it more possible to use safely.

Back to looking at this. I tweaked the hazard analysis so that with a function foo(mozilla::dom::Uint8Array& aData) that accepts a reference to a TypedArray_base subclass, that variable's type is considered to hold unrooted GC pointers. That flags any situation where a GC can happen while the parameter is still live, regardless of whether ComputeState has been called. (Actually, if it really is one of these types and not a RootedSpidermonkeyInterface type, then these are real hazards.) Technically, this is overzealous -- the analysis can't distinguish between pointers and references, and it's possible that some of these are doing a Handle-like thing by passing a pointer to a rooted typedarray. But probably not.

That flagged 73 hazards, 34 of them in TestJSImplGenBinding.cpp, which is an autogenerated test program. A number of these are bogus because they require more annotations to ignore things like PLDHashTable function pointers that aren't going to GC. The remainder that seems definite:

  • WebCryptoTask.cpp:467 : the parameter const CryptoOperationData& aData crosses over the call to Init which calls JS_GetProperty, and then goes into the call SetData(aData). There's definitely a classic rooting hazard here with the object pointer. I'm not sure when the ComputeState happens.
  • Actually, a bunch of the hazards are basically the same thing in different locations in the same file.
  • Promise.h:322 : standard rooting hazard. aArgument can be invalidated during the AutoEntryScript constructor, and is then passed into ToJSValue.
  • dom/canvas/ClientWebGLContext.cpp:3103 : GetBufferSubData parameter dstData crosses ValidateNonNegative which can GC before being used. Standard rooting hazard. Given the parameter name, the data is probably overwritten, but that would still be writing after freeing.
  • the same thing later in the file, in ReadPixels.

Not too bad of a list. Note that all of the "classic rooting hazards" might also be problems with interior pointers, so just fixing the rooting problem may not be sufficient.

Assigning to sfink since he's looking into this.

Assignee: nobody → sphink

Steven, could you please have sfink finish the work for this sec-high bug?

Flags: needinfo?(sdetar)

I will talk with sfink off line about this bug

Flags: needinfo?(sdetar)

Steve, could you please find time to finish the work for this sec-high bug?

Flags: needinfo?(sphink)

Work is in progress, but it's going to be gradual and probably require some fairly extensive changes. The basic issue is that the DOM bindings typed array handling is doing too many different things. I am changing the JSAPI for typed arrays to move some of that work out of the DOM code, at which time we can make smaller more targeted classes that will cover the different scenarios currently being (mis)handled by the all-encompassing classes currently being used.

Depends on: 1720422
Flags: needinfo?(sphink)

Set to S2 as this is a sec-high and could possibly hit UAF.

Severity: -- → S2

Peter said it requires new APIs to fix this properly. He's going to schedule a meeting with Steve to discuss.

I talked with peterv and started going through uses of ComputeState to try removing them to see what new APIs we need. I'll do a recap of my findings so far.

The set of hazards we're trying to resolve:

  • the standard GC rooting hazards: something holds onto a dom::TypedArray_base subclass, we do a GC, the JSObject is deleted.
  • the object is rooted (eg via RootedSpiderMonkeyInterface), but the data is stored inline in the JSObject and gets moved during a GC
  • the data is moved while the object is not rooted
  • whether the object is rooted or not, the length changes as a result of JS code running in between when the length is retrieved and used

That last one can only happen if you detach the underlying ArrayBuffer right now, but we will be adding resizable ArrayBuffers which will introduce another way for it to happen. I haven't kept up with the latest spec proposals, but I think the length can change to any value: zero, smaller but nonzero, and larger.

Inline data will always be fairly small. Right now, it looks to be at most 96 bytes on 64-bit.

The set of facilities I'm thinking of having to handle this code are:

  • A FixedData() getter that returns data guaranteed not to move during a GC, by requiring the caller to pass in a small buffer. If the data is inline, the data will be copied into the buffer and a pointer to the buffer returned. Otherwise, it will return the actual malloc pointer.
  • data and length accessors that require the caller to pass in a nogc parameter. This is not ironclad; the caller could still hang onto the data pointer or length past the end of the lifetime of that parameter. But at least you have to work to make things unsafe, and it enables the static analysis to double-check whether any GC is possible within the nogc scope.
  • a CopyData() accessor that copies the data without exposing any pointers at all. Most of the time, that's all that's happening anyway, though the copy is sometimes buried a couple of layers down.

One limitation of the FixedData() approach is that it does not require the object to be rooted. You could call that method on an unrooted structure, do a GC, and have the malloc pointer freed. I'm still working out if there's a way to require rooting for it.

I will post some patches with my current revision of the APIs needed to remove ComputeState, together with a couple of patches where I use them in a small handful of places, and peterv will look into going through the rest of the code. When I finish up a few other things, I hope to join in the fun.

Once ComputeState is gone, then we can mark dom::TypedArray_base as a GC pointer, and the hazard analysis should be able to double-check our work.

Attached file Bug 1690111 - (unused) getByteLengthAndData (obsolete) —
Status: NEW → ASSIGNED
Attachment #9251036 - Attachment description: Bug 1690111 - Remove dom::TypedArray::ComputeState from Crypo.cpp → Bug 1690111 - Remove dom::TypedArray::ComputeState from Crypto.cpp
Attachment #9251040 - Attachment description: Bug 1690111 - Expose a FixedData() accessor on RootedSpiderMonkeyInterface → Bug 1690111 - Move FixedData() accessor from TypedArray_base to only be on RootedSpiderMonkeyInterface

Another wrinkle with the FixedData approach is that because it does not require passing in a nogc token, that it is then possible to run JS code while holding the pointer. And that JS code could detach the ArrayBuffer or view, making the pointer invalid (or equivalently, making the valid length be zero). One of the patches in this series adds a FreezeLength() API that returns the length but also makes anything that might change it fail until UnfreezeLength() is called. It's a pretty heavy-handed approach, and it causes the failure to happen when you try to detach rather than when you try to use the length, but it's safer. I'm still questioning whether it's a good idea.

Attachment #9251040 - Attachment description: Bug 1690111 - Move FixedData() accessor from TypedArray_base to only be on RootedSpiderMonkeyInterface → Bug 1690111 - Replace FixedData() accessor with AcquireFixedData()/ReleaseFixedData() pair, only on RootedSpiderMonkeyInterface

Note that at least one of the fixes here will produce a false alarm hazard, because the analysis thinks a forgotten refptr could GC in its destructor. I'm not sure if I'm supposed to be dependencies on insecure bugs from secure bugs, but the fix is bug 1746090.

Status update - Steve has created some new APIs that we're trying out and did some example fixes using them. The next step is for Peter to see how far that could get us.

Status update - Steve and Peter made good progress on this. New APIs Steve added are feasible. Peter is applying the new APIs to the surrounding code but still a lot to do.

Depends on: 1746090

Assigned to Peter as he has been working on this.

Assignee: sphink → peterv

Status update - Peter already converted many callers into new APIs, still have 9-10 to be converted.

Updates - Peter is still working on this. In his WIP, he is able to convert all the use cases as well as new ones he discovered with the new APIs he implemented.

Updates - Peter and Steve had a meeting to review/discuss WIPs two weeks ago. Peter is currently addressing the comments.

Updates - still working on this. Pushed the latest patches to try, and has been investigating failures.

Peter - would you please give a brief update on this bug? Thank you.

Flags: needinfo?(peterv)

I have been chasing a weird issue, which I finally found the cause of (accessing the data of a zero-length span tries to access an invalid pointer). That sees to have fixed the last failing test I was seeing.
We found one problematic DOM API that looks vulnerable, but luckily it's a chrome-only API so we don't think we need to worry about it (and the patches with the new API will fix it).

Flags: needinfo?(peterv)
Attachment #9251039 - Attachment description: Bug 1690111 - Implement the ability to "freeze" length changes for ArrayBuffers and their views. → WIP: Bug 1690111 - Implement the ability to freeze length changes for ArrayBuffers and their views.
Attachment #9302709 - Attachment description: WIP: Bug 1690111 - Make JS::FreezeArrayBufferOrViewLength works with SharedArrayBuffer. → Bug 1690111 - Make JS::FreezeArrayBufferOrViewLength works with SharedArrayBuffer. r?sfink!
Attachment #9286595 - Attachment description: WIP: Bug 1690111 - New TypedArray APIs. → Bug 1690111 - New TypedArray APIs. r?sfink!,farre!
Attachment #9286596 - Attachment description: WIP: Bug 1690111 - Use new TypedArray APIs for appending data to a container. → Bug 1690111 - Use new TypedArray APIs for appending data to a container. r?farre!
Attachment #9286597 - Attachment description: WIP: Bug 1690111 - Use new TypedArray APIs for copying data to a container. → Bug 1690111 - Use new TypedArray APIs for copying data to a container. r?farre!
Attachment #9286598 - Attachment description: WIP: Bug 1690111 - Use new TypedArray APIs for creating a container with a copy of the data. → Bug 1690111 - Use new TypedArray APIs for creating a container with a copy of the data. r?farre!
Attachment #9286599 - Attachment description: WIP: Bug 1690111 - Use new TypedArray APIs for processing data. → Bug 1690111 - Use new TypedArray APIs for processing data. r?sfink!,farre!
Attached patch Diff of generated binding code — — Splinter Review

Updates - patches in preview. Peter and Steve were running more static analysis, some tweaks in patches on the way.

Hi Steve, my understanding was that you were looking at static analysis issues before the holiday break. Any updates since then? Thanks.

Flags: needinfo?(sfink)

Updates per the chat with Steve - before the holiday, Steve looked into the static analysis warnings caused by Peter's patches. Steve changed the API slightly to make the region smaller where the analysis would look for problems (calls to functions that could GC) that reduced the number of warnings down to zero. However, the "zero warning" result is because of a bug in the analysis. So the next step here is that Steve improves the analysis so that it is able to check the smaller regions correctly.

Flags: needinfo?(sfink)

Specifically, I am now passing in an AutoCheckCannotGC&& parameter to the lambdas passed to eg ProcessTypedArrays. Unfortunately, the compiler converts any reference, even an rvalue reference, to a pointer before I see it. So this appears as if it were of type AutoCheckCannotGC*, which is not considered to be a GC pointer (it's a pointer to something considered to be a GC pointer, but the GC doesn't invalidate pointers to pointers).

This is coming up now because previously nothing would pass these token types by reference. I added logic to end an AutoCheckCannotGC's live range if it was passed to std::move to fix a tricky hazard (false alarm) that the changes in this bug created.

I am attempting a workaround: for types T like JS::AutoCheckCannotGC, consider a T* to also be a GC pointer. (A T** will still not, though hopefully those won't exist in practice!) That will prevent passing pointers to these things, but you probably shouldn't be doing that anyway?

I'll see how it goes. Hacks like this often cause unexpected problems.

Current progress: I implemented the rvalue ref fix. That showed 58 hazards.

The first hazard was where ProcessTypedArraysFixed called ApplyToTypedArrays, which still used the lambda mechanism that passes in an AutoCheckCannotGC&&. So since the actual processing GCed all over the place (which is why it was using *Fixed in the first place), it reported a hazard. I looked at splitting out an ApplyToTypedArraysFixed, but that seemed like a lot of duplicate template-heavy code, so instead I called nogc.reset() to end the GC-forbidding scope.

That "reduced" the hazards from 58 -> 78. Typical.

It looks like my nogc.reset() isn't getting picked up, probably because the codegen type of nogc here is AutoCheckCannotGC* and so there's an extra dereference wherever it is used, and I'm not handling it for the purpose of identifying the reset()-like calls.

Fixing that did 78 -> 76. Progress, I guess?

The same hazard turned into an only slightly different hazard, because there's still an AutoCheckCannotGC in ApplyToTypedArray::Apply, which is called by ProcessTypedArraysFixed. So it seems like I can't avoid adding in the whole separate Fixed path.

Implementing takes the hazard count 76 -> 476. :-(

Almost all of these hazards are from calling various WebGL functions, because they can trigger calls to JS::WarnUTF8() which ends up calling through a function pointer to a warningReporter. Both the signature of these functions as well as the comments seem to suggest that they really could create an Error object and therefore GC.

At this point, I guess I could say that in the error case, the movable data will no longer be used and so this doesn't matter. But I would want to verify that this is in fact the case first. If it is, then I see two options: (1) thread through an AutoCheckCannotGC&& and reset() it when the problem is discovered and we're ok with mangling that data; or (2) lie and claim that the warning reporter will never GC. (Or make it so it never does, but I don't know if that's doable. It sounds like these warnings can end up triggering errors? But maybe if that's only on OOM, then it'll be ok...?)

Posting this as a record of the current status.

I implemented a class AutoIgnoreGCForAnalysis for the specific scenario here: you have a function that only GCs if it fails, and if it fails, callers will not use the invalidated data. That took the count from 476 → 476.

Side note: The reason why these counts often go up when I make "progress" is typically that you start with a hazard in function F, where you're holding an AutoCheckCannotGC live across a GC. You decide that it's too coarse, and std::move the nogc into the callee, adding it as a parameter in the process. Now you have a new hazard in the callee wherever it GCs. Only the callee is a macro-generated templatized method with 400 separate instantiations, and so you get a separate hazard reported for each. It would be nice to have an indicator of how "deep" the hazards are, so that when fixing a top-level hazard exposes 100 lower down, you can tell that you're making progress...

The next problem here is that the hazard shifted to ClientWebGLContext::UniformData being able to GC. It's the same basic problem—it can GC on an error condition, and we've created an AutoCheckCannotGC token disconnected from the data that says that any GC is bad, regardless of whether the data is accessed. It's not the only reason, though; even in a successful run, GetActiveLinkData is called and can GC. That appears to be through logging and is probably fake.

The natural solution at this point is to (1) fix the logging GC and (2) use AutoIgnoreGCForAnalysis again. But the latter bothers me, because it would be really really easy to have it in slightly the wrong place and conceal a real hazard. What would be slightly better would be to have a specific mechanism for "will GC only on an error". And I think it could be done: create a new annotation JS_HAZ_GC_IF_ERROR, give the function a boolean return value, and return false on error. Then in the caller, holding a GC pointer live across a regular GC call will still be a hazard, but holding a GC pointer live across a JS_HAZ_GC_IF_ERROR call only matters if it returns true. It won't be easy, but it's doable. And it could replace the use of AutoIgnoreGCForAnalysis.

(It also won't fix the problem here on its own. I'll still need to annotate away the logging GC, as well as a GC from a webgl::LinkResult falling out of scope.)

I will first do a larger suppression to check whether there's something worse waiting behind it, and then decide whether it's the right thing to try next. That did 476 → 133. I was hoping for more, but that's good at least.

There are still many hazards in the webgl code. The next fix was for some control flow through std::forward that the analysis couldn't follow. Fixing that did 133 → 124.

Attached file sfink work bundle as of 2023-01-30 (obsolete) —

I managed to push through the information about what parameters are rvalue references to remove the heuristic guessing. That did 124 → 55.

The first of those looks like it may be a legitimate hazard in mozilla::dom::SourceBuffer::PrepareAppend. I tentatively fixed that by using ProcessFixedData.

Next is a hazard where std::forward is again losing track of what's going on, as a result of the rvalue reference change. Re-fixing that, plus fixing a minor hazard resulting from an unused RefPtr, does 55 → 39. Now we're getting somewhere!

The next hazards are all of the same form: something calls ProcessData and does some data validation, triggering an error if the validation fails. The error path can GC, which does not matter because in that case the input data will not be used. Those are fixable by manually doing nogc.reset() in the error case, so that the non-error case will still prevent any GC from happening.

Fixing a handful of those, plus some minor code motion to move code out of the GC unsafe critical section, does 39 → 25.

That list still contains some large clusters of similar-looking hazards, so that number should continue to drop pretty quickly as I fix them.

Attached file sfink work bundle as of 2023-02-02 (obsolete) —
Attachment #9314971 - Attachment is obsolete: true

Did a round of fixes, which did 25 → 20. I think I may need to talk to peterv to work through some of the remainders.

Now cleaning up and landing bits and pieces of the parts of my patch stack that are improvements in the capabilities of the hazard analysis. Until the rest of this stuff lands, they shouldn't introduce any more hazards or if they do, those should be cleaned up separately.

The ~shared_ptr stuff is working. Current hazard count is 13. I think 2 of those are a problem with an annotation getting lost that I still need to look into. It's not related to the DOM ArrayBuffer changes here.

I'll upload my current bundle.

Attached file sfink work bundle as of 2023-02-17 (obsolete) —
Attachment #9315767 - Attachment is obsolete: true
Attached file trial of a rebased stack, 2023-02-28 (obsolete) —

I took a stab at rebasing. There were some difficult changes, some seemingly holding volatile pointers across Promise steps, which isn't going to work. I made everything compile, but very possibly broke stuff in the process, and some of that stuff is going to need further changes to be GC safe.

Attached file trial of a rebased stack, 2023-02-28 (obsolete) —

Oops, used the wrong base for the bundle.

Attachment #9320395 - Attachment is obsolete: true

I reworked this to make ProcessFixedData rewrite an inline ArrayBuffer or a buffer-less view to move the data to out-of-line, then used the new ProcessFixedData very liberally to resolve all remaining hazards. I have not done test runs for either performance or correctness, mostly because we haven't pushed this stuff to the try server yet for security reasons. This is where it would be nice to have a secure try server!

Time to start cleaning it up for review.

Sorry, delayed update here. I've been working on other things, and slowly landing the last of the hazard analysis changes needed.

All of the necessary hazard analysis changes have now landed.

What remains is to create a patch stack that modifies the added APIs, rebases the existing fixes here on top of current code, and updates them to the modified APIs.

I am working on an example stack. I will get it to the point where it compiles and has zero hazards, then get help in figuring out how to split it up to change each Gecko file once or at least in an orderly way. (Right now my stack has the original added APIs, Gecko changes to use them, incrementally modified APIs with incremental updates to the Gecko code, and then often a final fixup that modifies the Gecko code again.)

Note that I had all hazards handled at one point in time, but newer changes may introduce more. Also, some of my rebasing is suspect; I worried more about getting something that would compile than something that was correct! That's one place help will be needed.

Attached file 2023-04-26.bundle (obsolete) —
Attachment #9318510 - Attachment is obsolete: true
Attachment #9320403 - Attachment is obsolete: true

I'm going on PTO for a bit, so I am uploading the current state again. This still has 3 hazards, all from the same source, which is an annoying case where we are erroring out and the error handling can trigger a GC. Threading through a consumable AutoCheckCannotGC&& would fix it. So would more heavily using ProcessFixedData. But I'm running out of time to try out a fix, and it's not that relevant in the grand scheme of things. I can fix it later.

This patch stack is in good enough state to look at to see how things are probably going to need to end up. It isn't reviewable, though. Too many things that change it one way, then another, and then settle on a final shape. For some of the more heavily-impacted files, I'm planning on extracting out all the change those those files and squashing them together if they don't cross over any of the significant API change points.

I am also going to start submitting some of the infrastructure pieces for review.

Assignee: peterv → sphink
Attached file Bug 1690111 - Remove nogc from Convert (obsolete) —

Error reporting is a common source of GCs that produce false positives in the hazard analysis: you are working with some GC-volatile data (eg a GC pointer or a pointer to data stored inline in a GC thing) and something goes wrong, so you construct an error object to throw. The allocation can GC, even though the data will no longer be used after the GC. Passing the nogc token into the lambdas passed to ProcessData() allows telling the analysis "it's ok to GC now, I won't use the data anymore" by calling nogc.reset() or passing it via std::move(nogc) as an rvalue reference to some other function that consumes it.

Hi Peter/Steve, could you please share the status since comment 65? Thank you.

Flags: needinfo?(sfink)
Flags: needinfo?(peterv)
Attachment #9251039 - Attachment description: WIP: Bug 1690111 - Implement the ability to freeze length changes for ArrayBuffers and their views. → Bug 1690111 - Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views.
Attachment #9337565 - Attachment description: Bug 1690111 - Reset the nogc restriction for error conditions → Bug 1690111 - [DOMMatrix] Reset the nogc restriction for error conditions
Attached file Bug 1690111 - CurrentTime() is complicated (obsolete) —

It is surprisingly difficult to prove that CurrentTime() will not GC.

Attached file Bug 1690111 - Annotate shmem munging as non-GC (obsolete) —

Current status is that the stack is freshly rebased on top of mozilla-central, all hazards are resolved, and I have uploaded most of the current stack to phabricator. The remaining patches that I need to massage into shape are:

s22 31f0f1258a99 [AB.haz] [hazards] Annotate GL function pointers   
s23 3fab89a9c141 [AB.haz] Incomplete attempt to fix random number generation hazards   
s24 293082c4a134 [AB.haz] Rearrange Crypto.cpp code to avoid hazard   
s25 9d229572c8f4 [AB.haz] Annotate away WebGL graphics hazards   
s26 b5c72632cf81 [AB.haz] [hazards] Use same naming scheme and type for several annotations   
s27 a9c0af0435a6 [AB.haz] Annotate a function pointer in profiler AddMarkerToBuffer()   
s28 20a2de382aff [AB.haz] Fix hazards in compression streams   
s29 2ec97a18e768 [AB.haz] PK11_* functions are low level eand will not GC   

but some of my fixes are a little arduous, particularly in ClientWebGLContext.cpp. It is to the point where it would be useful to get feedback on whether we need to make any of this cleaner or faster (the stack makes pretty heavy use of ProcessedFixed*, which copy small data out of line). Tomorrow I will continue uploading the rest of the stack.

Flags: needinfo?(sfink)
Attached file TA.haz-2023-Jun-6.bundle —
Attachment #9330505 - Attachment is obsolete: true

I am currently reviewing the new patches, and figuring out how to simplify the whole stack.

Flags: needinfo?(peterv)

I am still reordering the stack and pulling patches into logical units.

Started splitting off some patches.

Depends on: 1841325

Figuring how to make the new API work with the MOZ_CAN_RUN_SCRIPT static analysis.

Attachment #9251039 - Attachment description: Bug 1690111 - Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views. → WIP: Bug 1690111 - Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views.

Depends on D131301

Attachment #9346741 - Attachment description: Bug 1690111 - Implement JS::EnsureNonInlineData → WIP: Bug 1690111 - Implement JS::EnsureNonInlineData. r?jonco
Attachment #9286595 - Attachment description: Bug 1690111 - New TypedArray APIs. r?sfink!,farre! → WIP: Bug 1690111 - New TypedArray APIs. r?sfink!,farre!
Attachment #9286596 - Attachment description: Bug 1690111 - Use new TypedArray APIs for appending data to a container. r?farre! → WIP: Bug 1690111 - Use new TypedArray APIs for appending data to a container. r?farre!
Attachment #9286597 - Attachment description: Bug 1690111 - Use new TypedArray APIs for copying data to a container. r?farre! → WIP: Bug 1690111 - Use new TypedArray APIs for copying data to a container. r?farre!
Attachment #9286598 - Attachment description: Bug 1690111 - Use new TypedArray APIs for creating a container with a copy of the data. r?farre! → WIP: Bug 1690111 - Use new TypedArray APIs for creating a container with a copy of the data. r?farre!
Attachment #9286599 - Attachment description: Bug 1690111 - Use new TypedArray APIs for processing data. r?sfink!,farre! → WIP: Bug 1690111 - Use new TypedArray APIs for processing data. r?sfink!,farre!
Attachment #9330512 - Attachment is obsolete: true
Attachment #9330511 - Attachment is obsolete: true
Attachment #9330510 - Attachment is obsolete: true
Attachment #9330509 - Attachment is obsolete: true
Attachment #9337584 - Attachment is obsolete: true
Attachment #9337583 - Attachment is obsolete: true
Attachment #9337582 - Attachment is obsolete: true
Attachment #9337581 - Attachment is obsolete: true
Attachment #9337578 - Attachment is obsolete: true
Attachment #9337565 - Attachment is obsolete: true
Attachment #9337554 - Attachment is obsolete: true
Attachment #9251035 - Attachment is obsolete: true
Attachment #9251034 - Attachment is obsolete: true
Attachment #9251033 - Attachment is obsolete: true
Attachment #9330508 - Attachment is obsolete: true
Attachment #9307068 - Attachment is obsolete: true
Attachment #9251042 - Attachment is obsolete: true
Attachment #9251270 - Attachment is obsolete: true
Attachment #9251041 - Attachment is obsolete: true
Attachment #9251040 - Attachment is obsolete: true
Attachment #9251038 - Attachment is obsolete: true
Attachment #9251037 - Attachment is obsolete: true
Attachment #9251036 - Attachment is obsolete: true
Attachment #9337555 - Attachment is obsolete: true
Attachment #9302709 - Attachment is obsolete: true
Attachment #9346751 - Attachment description: WIP: Bug 1690111 - Add JS::AutoCheckCannotGC::reset. r?sfink! → WIP: Bug 1690111 - Add JS::AutoCheckCannotGC::reset. r?peterv
Attachment #9346741 - Attachment description: WIP: Bug 1690111 - Implement JS::EnsureNonInlineData. r?jonco → WIP: Bug 1690111 - Implement JS::EnsureNonInlineArrayBufferOrView. r?jonco
Attachment #9346750 - Attachment description: WIP: Bug 1690111 - Make CloneBuffer use JSAPI directly. r?padenot! → Bug 1690111 - Make CloneBuffer use JSAPI directly. r?padenot!
Attachment #9346749 - Attachment description: WIP: Bug 1690111 - Improve nsIRandomGenerator APIs. → Bug 1690111 - Improve nsIRandomGenerator APIs. r?mccr8!
Attachment #9286595 - Attachment description: WIP: Bug 1690111 - New TypedArray APIs. r?sfink!,farre! → Bug 1690111 - New TypedArray APIs. r?sfink!,farre!
Attachment #9286596 - Attachment description: WIP: Bug 1690111 - Use new TypedArray APIs for appending data to a container. r?farre! → Bug 1690111 - Use new TypedArray APIs for appending data to a container. r?farre!
Attachment #9286597 - Attachment description: WIP: Bug 1690111 - Use new TypedArray APIs for copying data to a container. r?farre! → Bug 1690111 - Use new TypedArray APIs for copying data to a container. r?farre!
Attachment #9286598 - Attachment description: WIP: Bug 1690111 - Use new TypedArray APIs for creating a container with a copy of the data. r?farre! → Bug 1690111 - Use new TypedArray APIs for creating a container with a copy of the data. r?farre!
Attachment #9286599 - Attachment description: WIP: Bug 1690111 - Use new TypedArray APIs for processing data. r?sfink!,farre! → Bug 1690111 - Use new TypedArray APIs for processing data. r?sfink!,farre!
Attachment #9251039 - Attachment description: WIP: Bug 1690111 - Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views. → Bug 1690111 - Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views.
Attachment #9346751 - Attachment description: WIP: Bug 1690111 - Add JS::AutoCheckCannotGC::reset. r?peterv → Bug 1690111 - Add JS::AutoCheckCannotGC::reset. r?peterv
Attachment #9346741 - Attachment description: WIP: Bug 1690111 - Implement JS::EnsureNonInlineArrayBufferOrView. r?jonco → Bug 1690111 - Implement JS::EnsureNonInlineArrayBufferOrView. r?jonco
Attachment #9286599 - Attachment description: Bug 1690111 - Use new TypedArray APIs for processing data. r?sfink!,farre! → Bug 1690111 - Use new TypedArray APIs for processing data.
Attachment #9286599 - Attachment description: Bug 1690111 - Use new TypedArray APIs for processing data. → Bug 1690111 - Use new TypedArray APIs for processing data. r?sfink!,farre!

Depends on D152498

Comment on attachment 9346749 [details]
Bug 1690111 - Improve nsIRandomGenerator APIs. r?mccr8!

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: The patches do point out the potential security issues. The old API makes writing vulnerable code much easier. Both me and Steve Fink have looked over the existing users of the old API and we don't think we're currently vulnerable, but I don't think we're completely sure. But I don't think it'd be very easy to write an exploit.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Yes
  • Which older supported branches are affected by this flaw?: All
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?: Should just apply to beta and I have patches for esr115.
  • How likely is this patch to cause regressions; how much testing does it need?: This is a bit of a risky patch because it affects a lot of areas of the code and it changes out a complete API for another. We probably need to let it bake for a couple of releases before merging it to branches. As mentioned before, we don't think we're currently vulnerable, so the security risk of delaying shipping on esr should be fairly low.
  • Is Android affected?: Yes
Attachment #9346749 - Flags: sec-approval?
Attachment #9251039 - Flags: sec-approval?
Attachment #9286595 - Flags: sec-approval?
Attachment #9286596 - Flags: sec-approval?
Attachment #9286597 - Flags: sec-approval?
Attachment #9286598 - Flags: sec-approval?
Attachment #9286599 - Flags: sec-approval?
Attachment #9306012 - Flags: sec-approval?
Attachment #9346741 - Flags: sec-approval?
Attachment #9346750 - Flags: sec-approval?
Attachment #9346751 - Flags: sec-approval?
Attachment #9350875 - Flags: sec-approval?
Attachment #9350884 - Flags: sec-approval?
Attachment #9306012 - Flags: sec-approval?

Comment on attachment 9251039 [details]
Bug 1690111 - Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views.

Consider the stack sec-approved to land and start the bake.

Attachment #9251039 - Flags: sec-approval? → sec-approval+
Attachment #9286595 - Flags: sec-approval? → sec-approval+
Attachment #9286596 - Flags: sec-approval? → sec-approval+
Attachment #9286597 - Flags: sec-approval? → sec-approval+
Attachment #9286598 - Flags: sec-approval? → sec-approval+
Attachment #9286599 - Flags: sec-approval? → sec-approval+
Attachment #9346741 - Flags: sec-approval? → sec-approval+
Attachment #9346749 - Flags: sec-approval? → sec-approval+
Attachment #9346750 - Flags: sec-approval? → sec-approval+
Attachment #9346751 - Flags: sec-approval? → sec-approval+
Attachment #9350875 - Flags: sec-approval? → sec-approval+
Attachment #9350884 - Flags: sec-approval? → sec-approval+
Assignee: sphink → peterv
Pushed by pvanderbeken@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9c411bf84079 Improve nsIRandomGenerator APIs. r=mccr8,necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/9dcd2416f8a5 Make CloneBuffer use JSAPI directly. r=padenot https://hg.mozilla.org/integration/autoland/rev/2865fe682139 Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views. r=jonco https://hg.mozilla.org/integration/autoland/rev/7140866dd9f6 Add a copy assignment operator to RawNodePosition. r=farre https://hg.mozilla.org/integration/autoland/rev/692f2a759573 Add JS::AutoCheckCannotGC::reset. r=peterv https://hg.mozilla.org/integration/autoland/rev/40c6d6eed7f8 Implement JS::EnsureNonInlineArrayBufferOrView. r=jonco https://hg.mozilla.org/integration/autoland/rev/f70e09a7f5c6 New TypedArray APIs. r=farre https://hg.mozilla.org/integration/autoland/rev/daf7d747853a Use new TypedArray APIs for appending data to a container. r=farre,extension-reviewers,media-playback-reviewers,kmag,alwu,padenot https://hg.mozilla.org/integration/autoland/rev/5fa36d8fd2a5 Use new TypedArray APIs for copying data to a container. r=farre https://hg.mozilla.org/integration/autoland/rev/c1330b5e8c43 Use new TypedArray APIs for creating a container with a copy of the data. r=necko-reviewers,farre,kershaw https://hg.mozilla.org/integration/autoland/rev/6d0649fdafff Use new TypedArray APIs for processing data. r=farre,media-playback-reviewers,padenot,chunmin,sfink https://hg.mozilla.org/integration/autoland/rev/76c408bcd053 Remove old TypedArray APIs. r=farre https://hg.mozilla.org/integration/autoland/rev/5f2c25d194ad apply code formatting via Lando
Backout by csabou@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3c19998bbddf Backed out 13 changesets for causing fetch related crashes.
Pushed by pvanderbeken@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/423cb05856e8 Improve nsIRandomGenerator APIs. r=mccr8,necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/9497d4dcd737 Make CloneBuffer use JSAPI directly. r=padenot https://hg.mozilla.org/integration/autoland/rev/719bfaded837 Implement the ability to pin lengths (prevent changes to length, eg via detaching) for ArrayBuffers and their views. r=jonco https://hg.mozilla.org/integration/autoland/rev/9589445bbb52 Add a copy assignment operator to RawNodePosition. r=farre https://hg.mozilla.org/integration/autoland/rev/603e8067e68f Add JS::AutoCheckCannotGC::reset. r=peterv https://hg.mozilla.org/integration/autoland/rev/e39ae7aad8d9 Implement JS::EnsureNonInlineArrayBufferOrView. r=jonco https://hg.mozilla.org/integration/autoland/rev/88e9e7646e0d New TypedArray APIs. r=farre https://hg.mozilla.org/integration/autoland/rev/338049100f37 Use new TypedArray APIs for appending data to a container. r=farre,extension-reviewers,media-playback-reviewers,kmag,alwu,padenot https://hg.mozilla.org/integration/autoland/rev/8acc055a56d5 Use new TypedArray APIs for copying data to a container. r=farre https://hg.mozilla.org/integration/autoland/rev/fab63d21d38b Use new TypedArray APIs for creating a container with a copy of the data. r=necko-reviewers,farre,kershaw https://hg.mozilla.org/integration/autoland/rev/fe6fd98dac14 Use new TypedArray APIs for processing data. r=farre,media-playback-reviewers,padenot,chunmin,sfink https://hg.mozilla.org/integration/autoland/rev/b9bbd30d9797 Remove old TypedArray APIs. r=farre
Regressions: 1854929

What's the plan for ESR115 approval here?

Flags: needinfo?(peterv)
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-
Regressions: 1856672
Blocks: 1858533

(In reply to Ryan VanderMeulen [:RyanVM] from comment #100)

What's the plan for ESR115 approval here?

So far it stuck pretty well on trunk, but I think we should still wait a bit longer for regressions (in fact bug 1856672 just came in). Hopefully by 115.5esr we should have more confidence.

Flags: needinfo?(peterv)
Whiteboard: [adv-main119+r]

:peterv do you still plan on uplifting this to esr for 115.5? if so, can you nominate it for uplift?

Flags: needinfo?(peterv)

I think that given the crash volume and lack of progress in bug 1856672 I would rather postpone.

Flags: needinfo?(peterv)
Regressions: 1858678

Peter, given that bug 1856672 doesn't seem to have progressed, should we more forward with backporting this to ESR? Or give up on it entirely?

Flags: needinfo?(peterv)
Regressions: 1881858

Yeah, at this point it seems better to give up. If we do figure out a fix for bug 1856672 in time I'll come back to this and nominate, but it's looking unlikely.

Flags: needinfo?(peterv)

Bulk-unhiding security bugs fixed in Firefox 119-121 (Fall 2023). Use "moo-doctrine-subsidy" to filter

Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: