Closed Bug 1294742 Opened 8 years ago Closed 8 years ago

stylo/libnsstring: Add helpers and abstractions for Servo to create and manipulate Gecko UTF-16 strings

Categories

(Core :: CSS Parsing and Computation, defect, P3)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1295762

People

(Reporter: bholley, Unassigned)

References

Details

So far, we've basically sidestepped most of the string representation and ownership issues by using Gecko atoms from Servo and atomizing everything we come across.

However, this isn't going to work for everything, including things like CSSOM selector and property serialization. Creating a UTF-8 string and then converting it to UTF-16 and mucking around with byte buffers is a pretty lame way to go about it, so I think we should add some kind of abstraction whereby the Servo style system and create and manipulate nsStrings directly.

There was some discussion about this on IRC last night after I went to bed:

http://logs.glob.uno/?c=mozilla%23layout&s=12+Aug+2016&e=12+Aug+2016#c20170
Nathan, what's the current status on the allocators? Can we malloc a buffer in Servo and then free it in Gecko, or is that still verboten?
Flags: needinfo?(nfroyd)
Manish, this seems like the kind of thing that might be up your alley. Do you have cycles to work on it once the already_AddRefed glue is done?
Flags: needinfo?(manishearth)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #1)
> Nathan, what's the current status on the allocators? Can we malloc a buffer
> in Servo and then free it in Gecko, or is that still verboten?

We can do that now that glandium hooked HeapAlloc et al on Windows.
Flags: needinfo?(nfroyd)
Yeah, I can work on this next.
Assignee: nobody → manishearth
Status: NEW → ASSIGNED
Flags: needinfo?(manishearth)
Is there some reason that you couldn't call __rust_deallocate or whatever the Rust-side free'ing is, just to make life easier if the hooking does break in some way?
(In reply to Nathan Froyd [:froydnj] from comment #5)
> Is there some reason that you couldn't call __rust_deallocate or whatever
> the Rust-side free'ing is, just to make life easier if the hooking does
> break in some way?

The issue here is that we _may_ want to pass an owned char buffer from rust to C++, which then might transfer ownership into an nsString or something, and it would be a pain to track the need for a different deallocator.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #6)
> (In reply to Nathan Froyd [:froydnj] from comment #5)
> > Is there some reason that you couldn't call __rust_deallocate or whatever
> > the Rust-side free'ing is, just to make life easier if the hooking does
> > break in some way?
> 
> The issue here is that we _may_ want to pass an owned char buffer from rust
> to C++, which then might transfer ownership into an nsString or something,
> and it would be a pain to track the need for a different deallocator.

Just add another flag to nsString, can't hurt! ;)

That seems pretty reasonable.
Xidorn and Simon - can you brain-dump your requirements of what kinds of string operations you need to perform on the Servo side?
Flags: needinfo?(xidorn+moz)
Flags: needinfo?(simon.sapin)
I have an implementation of some ns*String bindings between rust and C++ inside of my XPCOM bindings patch: bug 1293362 (see part 2). Right now my only way to expose data from rust to C++ is through a borrowed reference to a string, and not by actually consuming the backing memory of a Vec<u8> or similar, but it wouldn't be too hard to add with an extra flag on the nsString. I also don't have support for nsAutoString or other types like that, but they shouldn't be impossible to add (though rust's lack of move constructors might make it a tad tricky).

It might be nice to share the implementations of these wrappers if the XPCOM bindings ever land.
I created the following function in patch I'm currently working on:
> uint16_t*
> Gecko_BeginWritingString(nsString* aString, size_t aLen)
> {
>   static_assert(sizeof(decltype(*aString->BeginWriting())) == sizeof(uint16_t),
>                 "Otherwise the reinterpret_cast below is wrong");
>   MOZ_ASSERT(aLen < UINT32_MAX);
>   aString->SetLength(aLen);
>   return reinterpret_cast<uint16_t*>(aString->BeginWriting());
> }

I guess for CSSOM, this kind of function with one which converts nsAString to Rust str should be enough. It doesn't seem to me that we need other string manipulation for CSSOM.
Flags: needinfo?(xidorn+moz)
Unless we want to use XPCOM string inside Servo to avoid the cost of additional conversion.
I guess we want to have a helper type which implements Write and fills the incoming data into nsAString directly.
Unassigning since mystor already has some bindings which seem like they will land -- let's wait for that first.
Assignee: manishearth → nobody
Status: ASSIGNED → NEW
Michael - what's your timeline on the string bindings? Sounds like it's blocking xidorn.
Flags: needinfo?(michael)
As discussed over video today, for strings in CSSOM return values, they’re typically the serialization in CSS syntax of something. For such serialization Servo implements the ToCss trait for a lot of types. This trait has a method that is generic over the type of "destination", the thing written to. It can be anything that implements the std::fmt::Write trait.

Michael’s bindings already implement this trait by transcoding to UTF-16 on the fly without allocating intermediate UTF-16 buffers. My guess is that this should be relatively cheap.

Eliminating this transcoding would be significantly more work in my opinion: we’d need not only a compiler plugin for UTF-16 literals, but also a replacement for the std::fmt module (which does things like converting numbers to strings)


For string in CSSOM parameters. Some of them as parsed as CSS, and rust-cssparser currently only supports &str (UTF-8, so with no unpaired surrogate). At least for now I suggest simply using String::from_utf16_lossy. We’ll need something if this conversion has a prohibitive performance cost *or* if we need to preserve unpaired surrogates, but again that would be much more involved.

This leaves string parameters that are not parser input. For example the property name in CSSStyleDeclaration.getPropertyValue. We may be able to write specialized operations like "are A and B a ASCII-case-insensitive match, where A is arbitrary UTF-16, and B is statically known to be lower-case and in the ASCII range" which might be faster than full transcoding.
Flags: needinfo?(simon.sapin)
JavaScript strings that happen to be within the Latin-1 range (U+0000 ~ U+00FF) can be represented in memory by SpiderMonkey in Latin-1, with one byte per code point.

Jack reminded me of another possible optimization that we’ve discussed before: if these strings additionally carry an "ASCII" bit (that says whether the string is entirely withing the ASCII range U+0000 ~ U+007F, maybe computed lazily on demand) then we can use it to skip transcoding (and ever copying, lifetimes permitting) since ASCII is a subset of UTF-8.

However, I don’t believe this bit currently exists and adding it would probably have to be in SpiderMonkey. Nicolas, does that sound like something you’d want to have? (Or, who should I talk to about this?)
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(michael)
The best person to ask is probably Jan, as he implemented the Latin1 backend for JSString.

I think refining the type from Latin1 to ASCII is doable, though I do not think the JS engine internals will take any advantage of it.  If this might help embedders to save multiple MB of memory, I do not see why we should not do that.
Flags: needinfo?(nicolas.b.pierron) → needinfo?(jdemooij)
We could consider moving from Latin1 to ASCII in SpiderMonkey. That would be a bit unfortunate, because it means we will use more memory. We'd need Firefox Telemetry or browser instrumentation to figure out how many of our Latin1 strings are ASCII-only (ideally we'd count the number of chars/bytes instead of the number of strings).

An ASCII bit is also an option; we can use 2 bits to represent {ASCII, non-ASCII, unknown}. Often we know statically when we create strings that they're ASCII (concatenating two ASCII strings, number to string, date formatting, well-known atoms, etc), it'd be cheap to set the ASCII bit in that case.

As a start, you could add the JSString bits and a |bool IsASCII(HandleString)| API that's smart enough to cache the result on the string. Then we can measure how often this cache hits/misses. (We just have to be careful with permanent atoms, as these are accessed on multiple threads. We can eagerly set the ASCII bit on them though.)
Flags: needinfo?(jdemooij)
Jan, the ASCII bit (or rather tri-state on two bits) as you described is exactly what I had in mind. I did not mean to suggest moving from Latin1 to ASCII.
(In reply to Simon Sapin (:SimonSapin) from comment #19)
> Jan, the ASCII bit (or rather tri-state on two bits) as you described is
> exactly what I had in mind. I did not mean to suggest moving from Latin1 to
> ASCII.

I understood :) I was just thinking out loud about our options.
Let me ask a silly question: how are we supposed to use the binding implemented in bug 1295762 for Stylo? Are we going to make that as a dependency of Geckolib somehow?
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #21)
> Let me ask a silly question: how are we supposed to use the binding
> implemented in bug 1295762 for Stylo? Are we going to make that as a
> dependency of Geckolib somehow?

I'm not completely sure how you will be able to take advantage of it. When stylo is moved in tree, it should be possible for it to use libnsstring just like it would use other in-tree crates, but for now libnsstring may have to be moved into a separate repository as well, and used separately during stylo development.
You could publish versioned releases of crates.io while keeping the source repository in mozilla-central. Crates.io requires GitHub for authentication, but published packages are just tarballs and the "repository" metadata can be an arbitrary URL.
I wasn't sure if we would want this crate on crates, as it is completely useless for anyone other than people who are linking directly to libxul. I suppose nsstring isn't a highly competed for name though.
I guess it is probably better be a crate under github.com/servo organization, or alternatively land it in servo/ports/geckolib directly like gecko_bindings.

Probably the later, I guess, would be the easiest way currently.

We may eventually need it elsewhere outside stylo (and servo binding). We can probably move it to an independent crate when we get there.
Priority: -- → P3
Michael, are you still working on this UTF-16 nsstring library?
Flags: needinfo?(michael)
Summary: stylo: Add helpers and abstractions for Servo to create and manipulate Gecko UTF16 strings in a safe and performant manner → stylo/libnsstring: Add helpers and abstractions for Servo to create and manipulate Gecko UTF-16 strings
Manish has vendored Michael's nsstring crate into servo. I think this should be considered fixed.
I have implemented the nsstring bindings for rust here: http://searchfox.org/mozilla-central/source/xpcom/rust/nsstring

It's in central. It was implemented in bug 1295762.

I'm not sure if that covers everything you were looking for, if it does not, let me know and I can add in more methods.
Flags: needinfo?(michael)
Thanks! I'll dupe this bug to your XPCOM string bindings bug 1295762.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
(In reply to Michael Layzell [:mystor] from comment #28)
> I'm not sure if that covers everything you were looking for, if it does not,
> let me know and I can add in more methods.

Oh, one more thing I think would be helpful is a function to borrow an nsACString as &str. That would allow us to use only one parameter for a string from Gecko to Servo.

We currently have two parameters for each string parameter, one for the pointer (*const u8), the other for the length (u32). I think passing a single "*const nsACString" would be much more convenient.

Could you probably add that?
Flags: needinfo?(michael)
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #30)
> (In reply to Michael Layzell [:mystor] from comment #28)
> > I'm not sure if that covers everything you were looking for, if it does not,
> > let me know and I can add in more methods.
> 
> Oh, one more thing I think would be helpful is a function to borrow an
> nsACString as &str. That would allow us to use only one parameter for a
> string from Gecko to Servo.
> 
> We currently have two parameters for each string parameter, one for the
> pointer (*const u8), the other for the length (u32). I think passing a
> single "*const nsACString" would be much more convenient.
> 
> Could you probably add that?

nsACString already derefs to &[u8], which is the best I can do (as I can't guarantee that the string is valid utf-8, which is required for &str). However, you can do the following, and get a `Result<&str, Utf8Error>`:

    std::str::from_utf8(&acstring)

If you _know_ that the nsACString is a valid utf-8 string, then you can do:

    unsafe {
        std::str::from_utf8_unchecked(&acstring)
    }

If those methods still feel too verbose, I can consider adding new methods to nsACString called utf8 and utf8_unchecked which do the above, but it felt unnecessary to me when you can so easily get a &[u8].
Flags: needinfo?(michael)
You need to log in before you can comment on or make changes to this bug.