Closed Bug 1449849 Opened 2 years ago Closed 4 months ago

Add a Latin1-or-UTF-16 WebIDL string type

Categories

(Core :: DOM: Bindings (WebIDL), enhancement, P3)

enhancement

Tracking

()

RESOLVED WONTFIX

People

(Reporter: hsivonen, Unassigned)

Details

(Whiteboard: [MemShrink:P2])

See bug 1449659 comment 1.

Let's add a WebIDL string type (to be used where specs say DOMString or USVString) that can hold either a Latin1 string or a UTF-16 string.

Use cases:

 * Getting and setting the data of DOM text nodes, which internally store Latin1 or UTF-16, without conversion.

 * In general, avoiding unnecessary inflation of JS engine Latin1 strings to UTF-16 in performance-sensitive APIs where the C++ side is willing to add a bit of complexity in order to gain performance (e.g. TextEncoder).

 * In general, avoiding unnecessary inflation to UTF-16 in APIs where the C++ side returns a value it knows to be always ASCII.

Details:

 * When a string is passed from the JS engine to the DOM implementation, the DOM implementation should be able to access the following:
   - An indication of Latin1 vs. UTF-16
   - A read-only view of the string valid during the call from the JS engine to the DOM. Span<const char> for Latin1 or Span<const char16_t> for UTF-16.
   - An indication of whether the string's storage is nsStringBuffer and if it is, the pointer to the nsStringBuffer.

 * When a string is passed from the DOM to the JS engine, the DOM should be able to pass an nsStringBuffer, it's filled length and an indication of whether it has Latin1 or UTF-16 semantics.
Summary: Add a Latin1-or-UTF-16 string type → Add a Latin1-or-UTF-16 WebIDL string type
Blocks: 1449861
Priority: -- → P3
Whiteboard: [memshrink]
Whiteboard: [memshrink] → [MemShrink:P2]
> A read-only view of the string valid during the call from the JS engine to the DOM.

In practice, this means the bindings have to make a copy at some point.  Ideally we know whether we are OK with Latin1 on the output end before we make that copy, so we don't end up having to make two copies.

If we actually add a new IDL type here, that seems doable.  If we try to automagically have DOMString mean Latin1 is ok based on characteristics of the callee, that would be somewhat more difficult.
Some implementation thoughts:

1)  We may want to do this with an extended attribute, after implementing <https://bugzilla.mozilla.org/show_bug.cgi?id=1359269>.  But in the short term just adding a new type might be fine.

2)  We may want to restrict this to function arguments and not worry about dictionary members, sequence members, etc.  That would allow us to only change FakeString, effectively.
Blocks: 1489042
We probably ought to mention this in our MDN WebIDL guide.
Keywords: dev-doc-needed

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #1)

A read-only view of the string valid during the call from the JS engine to the DOM.

In practice, this means the bindings have to make a copy at some point.

Where can I read more about the source of this constraint? Is the problem that the garbage collector runs on a different thread and can free the JS string during the call into C++? Can we not use the same mechanism for keeping the string alive that we'd use to keep it alive while making a copy?

Is the problem that the garbage collector runs on a different thread and can free the JS string during the call into C++?

No, the problem is that the GC might run due to something that the C++ or something later in the binding implementation does and move the JS string in memory during the call into C++.

We could try to optimize out the copy in cases when we know GC is not possible after we start referencing the string data. In practice this would mean:

  1. We could only do this for the last argument to the method (or the argument to a setter), because processing of a later argument can nearly always run GC. There are some exceptions, but they're a little complicated.

  2. We could only do this for C++ methods that we can prove can't run GC.

Can we not use the same mechanism for keeping the string alive that we'd use to keep it alive while making a copy?

The mechanism used while making a copy is ensuring that GC does not run between when we grab the pointer into the memory and when we complete copying, by not making any calls that might trigger GC. There is both static analysis and (debug-only) runtime asserts to make sure that this last holds.

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #5)

Is the problem that the garbage collector runs on a different thread and can free the JS string during the call into C++?

No, the problem is that the GC might run due to something that the C++ or something later in the binding implementation does and move the JS string in memory during the call into C++.

I wonder how much we need to worry about this...

The string object itself obviously needs to handle the backing string moved by GC, but string API consumers already don't have strong guarantees about how long raw character pointers returned by .get()/.BeginReading() will stay alive. They can be freed any time something mutates a string.

If we're worried about code making bad assumptions about the lifetimes of those pointers, it might make sense to make the rooting hazard analysis aware that those methods might return pointers to GC data that can't be kept alive across GCs.

The string object itself obviously needs to handle the backing string moved by GC

Right. So we can't just wrap a nsDependentString around the chars+length coming from the JS engine and call it a day. We'd need a new nsAString implementation that's GC-aware.

They can be freed any time something mutates a string.

True, if they mutate the specific nsAString instance that you are working with. Bindings have always guaranteed that no one would mutate that string without obvious cheating: it's allocated in the bindings and passed as a const reference.

it might make sense to make the rooting hazard analysis aware that those methods might return pointers to GC data

Yes, that would help a lot. If we had that and a GC-aware nsAString, that would allow us to avoid copying.

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #5)

Is the problem that the garbage collector runs on a different thread and can free the JS string during the call into C++?

No, the problem is that the GC might run due to something that the C++ or something later in the binding implementation does and move the JS string in memory during the call into C++.

OK, so not a threading problem but a problem of potentially any call from C++ to JS API causing the string to move?

We could try to optimize out the copy in cases when we know GC is not possible after we start referencing the string data. In practice this would mean:

  1. We could only do this for the last argument to the method (or the argument to a setter), because processing of a later argument can nearly always run GC. There are some exceptions, but they're a little complicated.

My primary use cases for this in the JS-to-DOM direction are:

  1. TextEncoder.encodeInto()
  2. The setter and the methods that manipulate the 'data' member of a DOM text node
  3. TextEncoder.encode()
  4. Implementation internals of bug 1449861 (WebIDL string type that shows UTF-8 to the C++ code).

Of these, TextEncoder.encodeInto() also has a later argument after the string: The Uint8Array to write into. The C++ code has to call ComputeLengthAndData() on the Uint8Array. Is that enough to potentially cause the JS engine to move the string data around?

As noted in comment 0, I expect the JS string data to be exposed as Span<const char> or Span<const char16_t>. That is, I don't expect this feature to provide general nsCString semantics. However, the rules for the JS string data pointed to by Span<const char> or Span<const char16_t> not going away should be clear enough for people to understand so as not to program use-after-frees.

. . .

In the DOM-to-JS data flow direction, I care mainly about the getter for the 'data' member of DOM text nodes. Other than that, there might be some opportunities on things like getters for always-ASCII (Punycode or percent-encoded) URL components. (There are a bunch of other APIs that always return ASCII, but those probably aren't perf-sensitive.)

OK, so not a threading problem but a problem of potentially any call from C++ to JS API causing the string to move?

Yes, for calls that can gc. Some can't and we know that statically... But most can.

Of these, TextEncoder.encodeInto() also has a later argument after the string: The Uint8Array to write into.

That would actually be safe, I think. The argument conversion for Uint8Array is side-effect free. That said, this is not super-obvious at a glance...

The C++ code has to call ComputeLengthAndData() on the Uint8Array.

That's side-effect free and can't cause GC.

If you don't care about AString compat, then we have more options here. Most simply, we can do something where bindings just unwrap to rooted linear JSStrings on the stack and then we convert from linear string to Span<> at C++ argument-passing time, after all the other argument conversions have run. That would reduce problems to the case when the C++ implementation can actually GC, and I think we might be able to make the rooting analysis handle that for us. Steve, thoughts?

In the DOM-to-JS data flow direction, I care mainly about the getter for the 'data' member of DOM text nodes.

In the DOM-to-JS direction we generally end up doing a no-copy external string. That said, in the specific case you mention if the text node is holding on to one-byte text we do end up making a copy right now...

Flags: needinfo?(sphink)

Ooh boy.

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #5)

  1. We could only do this for the last argument to the method (or the argument to a setter), because processing of a later argument can nearly always run GC. There are some exceptions, but they're a little complicated.

Any argument could be a problem, not just later ones. In old terminology, there is no sequence point between arguments, so the code for the expressions in f(expr1, expr2) can run in any order, or even interleaved (and I have observed this in practice -- it was a rooting hazard that was causing crashes, so it's not purely theoretical.) The new terminology is that they are "unsequenced", which means the same thing.

But if that's a problem, I think you could just do the GC-ing argument computation into temporaries just before the call, and have full control over the sequencing.

  1. We could only do this for C++ methods that we can prove can't run GC.

Can we not use the same mechanism for keeping the string alive that we'd use to keep it alive while making a copy?

The mechanism used while making a copy is ensuring that GC does not run between when we grab the pointer into the memory and when we complete copying, by not making any calls that might trigger GC. There is both static analysis and (debug-only) runtime asserts to make sure that this last holds.

Keeping strings alive is relatively easy, though it's not always easy to prove to the compiler or static analysis that you're doing so. Any reference to the string will do; we usually have a Rooted<JSString*> sitting on the stack. Functions that operate on strings generally take Handle<JSString*> parameters to ensure that you've done that, and we require passing in some sort of AutoRequireNoGC& token to make the caller promise it's aware of the issue and is avoiding calling anything that can GC. (See eg JS_GetLatin1StringCharsAndLength.)

For cases where you want to access strings chars and GC, there's AutoStableStringChars, https://searchfox.org/mozilla-central/rev/c21d6620d384dfb13ede6054015da05a6353b899/js/public/StableStringChars.h#37-46 Note that I hope to be landing nursery strings soonish, but the warning about that in the comment sounds more dire than it really is. At least for now larger strings, even in the nursery, still point to malloced string data that does not move and thus no copying is required. AutoStableStringChars also roots the string to keep it alive.

(I notice that it's returning mozilla::Range<> to hand back the chars; I guess that should be changed to Span<> now.)

I guess I'm not following exactly what would happen with eg TextEncoder.encodeInto(), but it seems like it would be possible to do all GCing operations before retrieving pointers to the chars? And then we'd require an AutoRequireNoGC& (probably using the concrete class AutoCheckCannotGC on the stack) to keep you honest. Does that cover what you need?

Or do you need a Span<> that is GC-aware, so that you can't let it escape or the analysis will scream at you? The difference is just between having separate AutoCheckCannotGC and Span<> objects vs wrapping them up in one thing. I can make something for the latter if that's important; it would really be no different from a struct that contained the Span<> and an AutoCheckCannotGC and presented a Span-like API.

For short strings (up to 23 latin1 chars on 64-bit), it should all still work fine as long as you don't need to GC. If you need to GC, then AutoStableStringChars will not prevent copying.

Hopefully that answers the question?

Flags: needinfo?(sphink)

I'm mildly concerned that such a type would potentially inhibit moving JS strings from being Latin-1-or-UTF-16 to UTF-8-or-UTF-16, with a flag in the UTF-8 case to note "this is ASCII" and with something like automatic degrading to UTF-16 on non-sequential indexing, or indexing too far away from a cached index-spot. Such a setup might not be feasible/shippable now; but we should not actively work against that long-term possibility.

Would having such a thing in WebIDL possibly make it easier for people to depend on characteristics that this alternative model would support only poorly? I don't know. And I'd like us to be careful before doing something quickly that we might eventually regret long-term.

(In reply to Steve Fink [:sfink] [:s:] from comment #10)

  1. We could only do this for the last argument to the method (or the argument to a setter), because processing of a later argument can nearly always run GC. There are some exceptions, but they're a little complicated.

Any argument could be a problem, not just later ones. In old terminology, there is no sequence point between arguments, so the code for the expressions in f(expr1, expr2) can run in any order, or even interleaved (and I have observed this in practice -- it was a rooting hazard that was causing crashes, so it's not purely theoretical.) The new terminology is that they are "unsequenced", which means the same thing.

I think what was meant was the processing that WebIDL-ish bindings do to coerce arguments of "any" type into arguments of the type of the actual DOM function. Not processing of arguments to exact C++ functions, subject to the rules and non-rules of C++.

But if that's a problem, I think you could just do the GC-ing argument computation into temporaries just before the call, and have full control over the sequencing.

...wait, we pass in temporaries in bindings code? That seems super-reckless versus having sensible locals.

Any argument could be a problem, not just later ones.

To be clear, the code generated for a Web IDL function with, say, 3 args, looks like this, in pseudo-code:

   doStuff(cx, argc, vp) {
     let arg1 = convertArg1();
     let arg2 = convertArg2();
     let arg3 = convertArg3();
     callC++Function(arg1, arg2, arg3);
   }

the convertArgN() calls can trigger GC in many cases, but are most definitely sequenced in a particular order (required by the IDL spec and easily observable in various ways in many cases). The call to the C++ function, which has unsequenced things going on, shouldn't do GC stuff to argN at that point.

you could just do the GC-ing argument computation into temporaries just before the call

Right, that is what we already do. ;)

For cases where you want to access strings chars and GC, there's AutoStableStringChars

Ah. That might just address Henri's use cases in the simplest way. It will still copy under the hood sometimes, but only if needed...

but it seems like it would be possible to do all GCing operations before retrieving pointers to the chars?

Yes.

probably using the concrete class AutoCheckCannotGC on the stack

And this will trigger the static analysis to make sure the C++ callee can't GC? We can certainly do that.

Or do you need a Span<> that is GC-aware, so that you can't let it escape or the analysis will scream at you?

I don't think we do: callees of binding methods shouldn't expect their args to outlive returning to the bindings unless they explicitly copy. I may end up creating a struct that bundles up a Span and AutoCheckCannotGC anyway, just to make the binding code and codegen simpler, but we don't need that as part of JSAPI if all the pieces are. That's assuming we don't just use AutoStableStringChars.

we pass in temporaries in bindings code?

We do some of all. ;) We have locals for storing the state and then temporaries for ensuring our callees are not insane.

For example, the current binding code for strings is more or less this:

   nsString foo;
   ConvertJSArgToString(foo);
   CallImplementationMethod(Constify(foo));

where

  template <typename T>
  const T& Constify(T& arg) {
    return arg;
  }

so callees don't declare that they take a non-const reference and start mutating things they shoud not mutate.

(In reply to Jeff Walden [:Waldo] from comment #11)

I'm mildly concerned that such a type would potentially inhibit moving JS strings from being Latin-1-or-UTF-16 to UTF-8-or-UTF-16, with a flag in the UTF-8 case to note "this is ASCII" and with something like automatic degrading to UTF-16 on non-sequential indexing, or indexing too far away from a cached index-spot. Such a setup might not be feasible/shippable now; but we should not actively work against that long-term possibility.

Having UTF-8 internally in JS strings in the future would be cool. I don't think this is going to prevent that. Notably:

  1. The use cases for this type for JS-to-DOM direction are very few, so the future refactoring burden should be manageable.
  2. For the DOM-to-JS direction, only the getter for text node "data" needs the Latin1 capability. For other plausible cases, we know we have ASCII rather than Latin1, so we could introduce an ASCII-only type for the DOM-to-JS direction and the same thing would work with both Latin1 and UTF-8-encoded SpiderMonkey-internal representations.

Would having such a thing in WebIDL possibly make it easier for people to depend on characteristics that this alternative model would support only poorly?

TextEncoder is all about encoding to UTF-8, so if SpiderMonkey got UTF-8 internal representation, we should change TextEncoder anyway. As for DOM text nodes, as long as SpiderMonkey and the DOM both do the same Latin1 optimization, it's pretty sad for them not to tell each other about it.

(In reply to Steve Fink [:sfink] [:s:] from comment #10)

For cases where you want to access strings chars and GC,

I think it sufficient to have access to the string's storage and to know that GC won't make that storage to go away before the end of the C++ implementation of a WebIDL method. That is, if it TextEncoder::EncodeInto() (https://searchfox.org/mozilla-central/source/dom/encoding/TextEncoder.cpp#53), aSrc was a Latin1-or-UTF-16 type, I'd like to know that it's safe to read from after the call to aDst.ComputeLengthAndData(); and before the end of TextEncoder::EncodeInto().

there's AutoStableStringChars, https://searchfox.org/mozilla-central/rev/c21d6620d384dfb13ede6054015da05a6353b899/js/public/StableStringChars.h#37-46

Performing possible conversion to string using WebIDL rules and then letting the C++ implementation of the WebIDL method see the string as AutoStableStringChars seems sufficient to me.

Additional question:
Are SpiderMonkey strings not guaranteed to have contiguous storage? (I wouldn't mind having to iterate over multiple segments in TextEncoder::encodeInto() to avoid copies.)

then letting the C++ implementation of the WebIDL method see the string as AutoStableStringChars seems sufficient to me.

So just to be clear, AutoStableStringChars will copy for short strings (which are stored inline in the JSString and hence can move if it moves) and for nursery-allocated strings when we turn those on. Long non-nursery strings won't get copied; I assume that's our main concern here?

Are SpiderMonkey strings not guaranteed to have contiguous storage?

They are not. Concatenation in SpiderMonkey produces ropes. There's a comment at https://searchfox.org/mozilla-central/rev/bee8cf15c901b9f4b0c074c9977da4bbebc506e3/js/src/vm/StringType.h#114-144 that summarizes the string representation in SpiderMonkey.

If you were willing to deal with non-linear strings you'd need to deal with the rope tree structure.

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #16)

then letting the C++ implementation of the WebIDL method see the string as AutoStableStringChars seems sufficient to me.

So just to be clear, AutoStableStringChars will copy for short strings (which are stored inline in the JSString and hence can move if it moves) and for nursery-allocated strings when we turn those on. Long non-nursery strings won't get copied; I assume that's our main concern here?

Are SpiderMonkey strings not guaranteed to have contiguous storage?

They are not. Concatenation in SpiderMonkey produces ropes. There's a comment at https://searchfox.org/mozilla-central/rev/bee8cf15c901b9f4b0c074c9977da4bbebc506e3/js/src/vm/StringType.h#114-144 that summarizes the string representation in SpiderMonkey.

If you were willing to deal with non-linear strings you'd need to deal with the rope tree structure.

Maybe the conclusion here should be that the couple of cases where it would make sense to expose the UTF-16-or-Latin1 implementation detail to DOM code in the JS-to-DOM direction (text node data setter and inserter as well as TextEncoder) are such that it doesn't make sense to cause copying to create a flat string at the binding layer and instead it would make sense to expose something that allows iteration over the rope: if one tries to optimize away the Latin1-to-UTF-16 inflation, why not also optimize away a round of copies. (I take it that a rope can contain UTF-16 segments and Latin1 segments?)

Then text node data setters could iterate over the rope copying directly to the text node storage and TextEncoder could iterate over the rope and call encoding_rs::mem::convert_utf16_to_utf8_partial or encoding_rs::mem::convert_latin1_to_utf8_partial on each rope segment as appropriate. This should probably happen in a helper function that has TextEncoder::EncodeInto semantics except that the helper function should output to a Span rather than Uint8Array and return a mozilla::Tuple<size_t, size_t>, so that bug 1449861 could be implemented in terms of the same helper function.

Text node data getter should still get some mechanism to return a Latin1 string from DOM to JS.

No longer blocks: 1449861
No longer blocks: 1489042

Overtaken by bug 1561564.

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.