1402247 - Use encoding_rs for XPCOM string encoding conversions

Assignee

Description

•

7 years ago

Currently, encoding conversions when appending nsAString to nsACString or vice versa use completely different conversion implementations compared to encoding conversions near the I/O boundary. The latter use encoding_rs whose ASCII path is carefully optimized not only on x86 but also on ARM (currently ALU code written specifically with preference given to armv7 and SIMD code for aarch64). Meanwhile, the conversions between nsAString to nsACString use mostly very old ALU code with some recent SIMD additions that a) work only on x86/x86_64 and b) give up more easily than SIMD in encoding_rs.

To avoid code duplication, specifically to avoid the duplication of the innermost ASCII to ASCII, ASCII to Basic Latin and Basic Latin to ASCII loops that should be per-ISA-optimized, I think it makes sense to change XPCOM strings to reuse encoding_rs's inner loops.

To this end, I'm pursuing the following:

 1) Exposing the necessary operations as functions collected in a module under encoding_rs (not a separate crate so that the way this new part and the old parts of encoding_rs share code is not publicly exposed). I'm developing this on the "memory" branch of encoding_rs: https://github.com/hsivonen/encoding_rs/blob/memory/src/mem.rs

 2) Writing the different methods on nsAString and nsACString in Rust in the nsstring crate by calling the functions in encoding_rs::mem. I intend to start this after validating encoding_rs::mem by benchmarking and fuzzing.

 3) Making the corresponding C++ methods into inlines that call into FFI exposed by the nsstring crate.

Risks:

 * The FFI overhead (from inhibiting optimizations across the FFI boundary and potentially from causing extra stack frames) might outweigh the benefit from the speed of encoding_rs internals for short strings (shorter than 16).

 * Trying to address the above risk or bug 1396490 may lead to complexity we previously didn't have. (Note: When it would make sense even for non-XPCOM callers, encoding_rs::mem already mitigates bug 1396490 in some cases.)

 * Mission creep for encoding_rs, since encoding_rs::mem would in terms of what functionality the API provides would make more sense as a distinct crate but is a module just in order to reuse the encoding_rs internals in a way that doesn't make sense to expose cross-crate.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

7 years ago

Blocks: 1349528

Henri Sivonen (:hsivonen)

Assignee

Updated

•

7 years ago

Blocks: 1355101

Henri Sivonen (:hsivonen)

Assignee

Comment 1

•

7 years ago

Bug 497204 as about reorganizing the API as opposed to changing its implementation.

Eric Rahm [:erahm]

Comment 2

•

7 years ago

Sounds like a good plan. I'm guessing we'd be able to get rid of nsReadableUtilsSSE2 [1] and nsUTF8UtilsSSE2 [2] and the various (Lossy)ConvertTo functions would just be pass-throughs?

[1] http://searchfox.org/mozilla-central/source/xpcom/string/nsReadableUtilsSSE2.cpp
[2] http://searchfox.org/mozilla-central/source/xpcom/string/nsUTF8UtilsSSE2.cpp

Henri Sivonen (:hsivonen)

Assignee

Comment 3

•

6 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #2)
> Sounds like a good plan. I'm guessing we'd be able to get rid of
> nsReadableUtilsSSE2 [1] and nsUTF8UtilsSSE2 [2] and the various
> (Lossy)ConvertTo functions would just be pass-throughs?
> 
> [1]
> http://searchfox.org/mozilla-central/source/xpcom/string/nsReadableUtilsSSE2.
> cpp
> [2]
> http://searchfox.org/mozilla-central/source/xpcom/string/nsUTF8UtilsSSE2.cpp

Yes.

Nika, in Austin we talked briefly about code organization and you advised me to put the new code in a file that's separate from servo/support/gecko/nsstring/src/lib.rs. I'm confused/clueless about range of options for doing that properly. As an example, when nsAString::append_utf8() becomes the implementation and the C++ side becomes the binding, how should the body of nsAString::append_utf8() (the code that'll perform allocations and call encoding_rs::mem::convert_utf8_to_utf16()) be put in a separate file?

Everyone, I have some questions about naming things.

In the case of nsAString, it's clear that self has UTF-16 semantics, so it's reasonable to have a method like append_utf8(), which is understood to convert from UTF-8 to UTF-16. However, in the case of nsACString, the UTF-8 vs. Latin1 semantics of self are not part of the type and have to be part of the operation. Would nsACString::append_latin1_to_utf8() and nsACString::append_utf8_to_latin1() be OK? (We don't have callers for these, yet, but seeking to enable us to use UTF-8 more, so having these ready for use is part of that enabling effort.)

What about operations that make sense only when self has UTF-8 semantics? If nsAString has is_latin1() and is_bidi(), should nsACString have is_latin() and is_bidi() (UTF-8 semantics implied, because with Latin1 semantics the answer is trivially `true` and `false`, respectively) or should nsACString have is_utf8_latin1() and is_utf8_bidi()?

Flags: needinfo?(nika)

Nathan Froyd [:froydnj]

Comment 4

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #3)
> In the case of nsAString, it's clear that self has UTF-16 semantics, so it's
> reasonable to have a method like append_utf8(), which is understood to
> convert from UTF-8 to UTF-16. However, in the case of nsACString, the UTF-8
> vs. Latin1 semantics of self are not part of the type and have to be part of
> the operation. Would nsACString::append_latin1_to_utf8() and
> nsACString::append_utf8_to_latin1() be OK? (We don't have callers for these,
> yet, but seeking to enable us to use UTF-8 more, so having these ready for
> use is part of that enabling effort.)

I think we want to put the names in the operation.  We have been talking about a more perfect world where nsCString is actually split into classes that enforce encoding, and having these operations named explicitly would at least point people in that direction.  Having explicit names would make any future string conversion efforts easier, since the semantics would be obvious at callsites.  Having them might also clue reviewers in to thinking about what encoding strings appear in, and squashing bugs before they appear.

> What about operations that make sense only when self has UTF-8 semantics? If
> nsAString has is_latin1() and is_bidi(), should nsACString have is_latin()
> and is_bidi() (UTF-8 semantics implied, because with Latin1 semantics the
> answer is trivially `true` and `false`, respectively) or should nsACString
> have is_utf8_latin1() and is_utf8_bidi()?

For avoidance of doubt, `is_utf8_latin1()` is testing if the string is entirely composed of codepoints that are:

1. ASCII; or
2. codepoints 128-255, encoded as UTF-8?

I think we want the UTF-8 semantics explicit, even if those names are a mouthful.  I don't know how to make the names any clearer without making them even larger (`is_latin1_in_utf8` or `is_latin1_with_utf8` sound slightly clearer to me, but I don't want anybody writing those out).  I have no expertise on the bidi front.

Nathan Froyd [:froydnj]

Updated

•

6 years ago

OS: Unspecified → All

Priority: -- → P2

Hardware: Unspecified → All

Henri Sivonen (:hsivonen)

Assignee

Comment 5

•

6 years ago

(In reply to Nathan Froyd [:froydnj] from comment #4)
> For avoidance of doubt, `is_utf8_latin1()` is testing if the string is
> entirely composed of codepoints that are:
> 
> 1. ASCII; or
> 2. codepoints 128-255, encoded as UTF-8?

Yes.

Henri Sivonen (:hsivonen)

Assignee

Comment 6

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #3)
> What about operations that make sense only when self has UTF-8 semantics? If
> nsAString has is_latin1() and is_bidi(), should nsACString have is_latin()
> and is_bidi() (UTF-8 semantics implied, because with Latin1 semantics the
> answer is trivially `true` and `false`, respectively) or should nsACString
> have is_utf8_latin1() and is_utf8_bidi()?

Hmm. On the C++ side, IsASCII() is a function rather than a method, so maybe on the Rust side, the story for the is_foo() methods that don't modify the string could be that one should call the functions in encoding_rs::mem. There's e.g. pub fn encoding_rs::mem::is_utf8_latin1(buffer: &[u8]) -> bool;

If these were methods on nsA[C]String, they'd be trivial wrappers around the encoding_rs::mem::is_foo() functions anyway.

Nika Layzell [:nika] (ni? for response)

Comment 7

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #3)
> Nika, in Austin we talked briefly about code organization and you advised me
> to put the new code in a file that's separate from
> servo/support/gecko/nsstring/src/lib.rs. I'm confused/clueless about range
> of options for doing that properly. As an example, when
> nsAString::append_utf8() becomes the implementation and the C++ side becomes
> the binding, how should the body of nsAString::append_utf8() (the code
> that'll perform allocations and call
> encoding_rs::mem::convert_utf8_to_utf16()) be put in a separate file?

In a crate, you can have an impl block for a type in a submodule for that type, for example:

    struct Stringy;
    
    // This would be in a separate file. It's identical semantically to putting it 
    // directly inline.
    mod conversions {
        use super::Stringy;
        impl Stringy {
            pub fn append_latin1_to_utf8(&mut self, latin1: &Stringy) { }
        }
    }
    
    #[test]
    fn test() {
        Stringy.append_latin1_to_utf8(&Stringy)
    }

https://play.rust-lang.org/?gist=bc67e4228ec409594e69cb6a4576c2b8&version=stable

Basically, submodules can access all private information and implement private details of the crate they're contained within. They are mostly used for organizational purposes.

I'd also be fine with the conversion methods being implemented as freestanding functions, and the methods defined on the actual types just being wrappers, but that might be less nice.

I basically just want to reduce the amount of code which is in the root of the crate, as right now the nsstring crate is a bit of a long unweildly single file.

> Everyone, I have some questions about naming things.
> 
> In the case of nsAString, it's clear that self has UTF-16 semantics, so it's
> reasonable to have a method like append_utf8(), which is understood to
> convert from UTF-8 to UTF-16. However, in the case of nsACString, the UTF-8
> vs. Latin1 semantics of self are not part of the type and have to be part of
> the operation. Would nsACString::append_latin1_to_utf8() and
> nsACString::append_utf8_to_latin1() be OK? (We don't have callers for these,
> yet, but seeking to enable us to use UTF-8 more, so having these ready for
> use is part of that enabling effort.)

I like those names. I think being very precise when it comes to the encodings used in every conversion method is a good idea, and I don't mind them being a little verbose.

> What about operations that make sense only when self has UTF-8 semantics? If
> nsAString has is_latin1() and is_bidi(), should nsACString have is_latin()
> and is_bidi() (UTF-8 semantics implied, because with Latin1 semantics the
> answer is trivially `true` and `false`, respectively) or should nsACString
> have is_utf8_latin1() and is_utf8_bidi()?

I like `is_utf8_latin1()` and `is_utf8_bidi()` personally. `is_latin1()` sounds to me like it's checking some sort of encoding flag on the string data. With the method `is_latin1()` I could imagine someone calling `s.is_latin1()` and then passing `s` to a function which expects a latin1 encoded string, which would be wrong.

Flags: needinfo?(nika)

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

6 years ago

(In reply to Nika Layzell [:mystor] from comment #7)
> In a crate, you can have an impl block for a type in a submodule for that
> type, for example:
> 
>     struct Stringy;
>     
>     // This would be in a separate file. It's identical semantically to
> putting it 
>     // directly inline.
>     mod conversions {
>         use super::Stringy;
>         impl Stringy {
>             pub fn append_latin1_to_utf8(&mut self, latin1: &Stringy) { }
>         }
>     }

I was unaware of this. Thank you!

> I like those names. I think being very precise when it comes to the
> encodings used in every conversion method is a good idea, and I don't mind
> them being a little verbose.

OK.

(In reply to Henri Sivonen (:hsivonen) from comment #6)
> Hmm. On the C++ side, IsASCII() is a function rather than a method, so maybe
> on the Rust side, the story for the is_foo() methods that don't modify the
> string could be that one should call the functions in encoding_rs::mem.
> There's e.g. pub fn encoding_rs::mem::is_utf8_latin1(buffer: &[u8]) -> bool;
> 
> If these were methods on nsA[C]String, they'd be trivial wrappers around the
> encoding_rs::mem::is_foo() functions anyway.

For the various encoding_rs::mem::is_foo() functions, instead of making them methods on nsA[C]String, I think I'm going to wrap them as functions in nsBidiUtils and nsReadableUtils starting with bug 1431025.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1431356

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/213974/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/213974/

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/214564/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/214564/

Henri Sivonen (:hsivonen)

Assignee

Comment 11

•

6 years ago

As seen in the patch, I'm making one design bet that differs quite a bit from the old code:

The old code inspects (as in, actually examines the value instead of just memcpying it) each incoming code unit twice: Once to calculate the length of the conversion result and then again to do the actual conversion.

My bet is that it'll be better (faster) to examine the value of each incoming code unit only once (when converting) even if this means that we *sometimes* need to do another malloc+memcpy. My bet is that this is also going to be more efficient that the complexity of making the conversions able request more output space in mid-conversion.

There are three types of conversions:
 1) Constant: The number of code units in conversion input is the number of code units in the output. (Latin1 to UTF-16 and vice versa.) There's nothing controversial about this. The allocation is always right.

 2) Expanding: The number of code units in the conversion output may be greater than the number of code units in the input. (When converting to UTF-8.) First, the code asks for a buffer capacity that can fit the input if it's ASCII. The allocation logic rounds the bucket size up. If a non-ASCII code unit is found, at that point, the worst case for the remaining data is calculated. If the worst case fits within the bucket rounding, there's no need to reallocate. If not, another reallocation happens. At the end, the allocation bucket may or may not be optimal for the final length. At present, we don't have the capability to shrink the buffer in this case.

 3) Shrinking: The number of code units in the conversion output may be less than the number of code units in the input. (When converting from UTF-8.) First, allocate for the ASCII case, which is the space-wise worst case. At the end, the allocation bucket may or may not be optimal for the final length. At present, we don't have the capability to shrink the buffer in this case.

Obviously, memcpy is better than actually paying attention to the values in the memory a second time. The time saved may or may not be enough to absorb the cost of another malloc, though, especially for smaller strings (where the time saving from not inspecting the data a second time is smaller).

My justification for this is: Especially at present (when DOM text nodes don't go via UTF-8), we'll be most often converting ASCII, and the allocation is right on the first attempt for ASCII. When we're dealing with data whose actual code unit number different factor is near 2, there's a good chance we end up in the best power-of-two allocation bucket anyway.

Once we start converting more text-node-like content from UTF-8 to UTF-16, the shrinkage is going to be near a factor of 3 for CJK, in which case we might end up overallocating and will actually need to shrink the buffer at the end in order not to waste memory. But for the memory waste to be significant, the strings have to be long, in which case not examining them twice buys more time, so I estimate that having to sometimes realloc will be OK.

This won't address the problem of nsAutoString's built-in buffer getting replaced with a malloced buffer more often than now, but conversions where the target is nsAutoString are by definition for strings that we expect to be shorter than outright text node content, so chances are most of those conversions are ASCII most of the time and, as noted, allocations in the ASCII case are always tight.

Henri Sivonen (:hsivonen)

Assignee

Comment 12

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #11)
> Once we start converting more text-node-like content from UTF-8 to UTF-16,
> the shrinkage is going to be near a factor of 3 for CJK, in which case we
> might end up overallocating and will actually need to shrink the buffer at
> the end in order not to waste memory. But for the memory waste to be
> significant, the strings have to be long, in which case not examining them
> twice buys more time, so I estimate that having to sometimes realloc will be
> OK.

One more thing: If we start doing the above and it becomes a problem, we'd better be converting from &str instead of potentially-invalid UTF-8 nsACString anyway and in that case we could re-introduce a counting pre-flight for strings over some threshold length and could do some more efficiently than with potentially-invalid nsACString.

Henri Sivonen (:hsivonen)

Assignee

Comment 13

•

6 years ago

Once we have the ability to do assign/append in Rust in the cases that convert, shouldn't we handle append/assign of Rust slices in Rust instead of wrapping the buffer in a dependent string and calling to C++?

In such a case, how should the nsA[C]StringLike setup change to forward actual nsA[C]String to C++ while handling Rust slices without the overhead of dependent string?

Is the best way something as simple as adding a function that returns an enum wrapping &nsACString when implementing for nsACString and returns an enum wrapping &[u8] for everything else?

Flags: needinfo?(nika)

Nika Layzell [:nika] (ni? for response)

Comment 14

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #13)
> Once we have the ability to do assign/append in Rust in the cases that
> convert, shouldn't we handle append/assign of Rust slices in Rust instead of
> wrapping the buffer in a dependent string and calling to C++?
> 
> In such a case, how should the nsA[C]StringLike setup change to forward
> actual nsA[C]String to C++ while handling Rust slices without the overhead
> of dependent string?
> 
> Is the best way something as simple as adding a function that returns an
> enum wrapping &nsACString when implementing for nsACString and returns an
> enum wrapping &[u8] for everything else?

Creating the dependent nsA[C]String shouldn't be too expensive & I hope it could be optimized away by the optimizer.

If we need to make it more efficient though, we could definitely add an alternate implementation which produces a &[u8] enum. I'd be fine with either approach, although I'd like some evidence that the nsA[C]String conversion can't be optimized away by LLVM.

Flags: needinfo?(nika)

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8944345 [details]
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/214564/diff/1-2/

Henri Sivonen (:hsivonen)

Assignee

Comment 17

•

6 years ago

Does this look like it's going in an OK direction?

I've followed the existing pattern of having foo and fallible_foo for every thing as separate implementations that where the infallible version causes a MOZ_CRASH as close to the point where we actually failed as possible.

This leads to quite a bit of duplication and makes me wonder if foo() should be implemented as fallible_foo.expect("OOM"); to avoid duplication while masking the precise OOM point.

Flags: needinfo?(nika)

Nika Layzell [:nika] (ni? for response)

Comment 18

•

6 years ago

mozreview-review

Comment on attachment 8944345 [details]
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust.

https://reviewboard.mozilla.org/r/214564/#review220970

::: servo/support/gecko/nsstring/src/conversions.rs:19
(Diff revision 2)
> +use conversions::encoding_rs::mem::*;
> +use conversions::encoding_rs::Encoding;
> +
> +#[inline(always)]
> +fn fallible_add(a: usize, b: usize) -> Result<usize, ()> {
> +    a.checked_add(b).ok_or(())

It feels like it might be cleaner to just use the options in the call sites for these functions, and then call .ok_or() at the call sites?

::: servo/support/gecko/nsstring/src/conversions.rs:37
(Diff revision 2)
> +
> +#[inline(always)]
> +fn fallible_times_two(a: usize) -> Result<usize, ()> {
> +    a.checked_mul(2).ok_or(())
> +}
> +

Consider moving the maybe_..._capacity stuff into rust code. It might look something like:

```
unsafe fn mut_buf_with_min_cap(str: &mut nsACString, min_cap: usize) -> Result<&mut [u8], ()> {
    str.fallible_set_capacity(min_cap)?;
    // Just poke at the struct from rust now, 'cause we can.
    let cap = str.capacity();
    let this = &mut nsCStringRepr = mem::transmute(str);
    Ok(slice::from_raw_parts_mut(this.data as *mut u8, cap as usize))
}
```

Then you can just call `set_length` when you're done writing into the buffer?

::: servo/support/gecko/nsstring/src/conversions.rs:73
(Diff revision 2)
> +        }
> +     )
> +}
> +
> +macro_rules! expanding_conversion {
> +    ($name:ident,

Please document these macros well. I want explanations as to what each of these macro arguments are intended to represent :-)

::: servo/support/gecko/nsstring/src/conversions.rs:79
(Diff revision 2)
> +     $fallible_name:ident,
> +     $convert:ident,
> +     $copy:ident,
> +     $math:ident,
> +     $other_ty:ty) => (
> +        fn $name(&mut self, other: $other_ty, old_len: usize) {

It's probably fine to just call the fallible version and .expect() on that. I'm not a big fan of the complexity being added here, and that's what most of the code on the C++ side does anyway.

I mostly had the separate implementations before on the rust side because they weren't too repetitive, and I figured it might be slightly more efficient.

::: servo/support/gecko/nsstring/src/conversions.rs:213
(Diff revision 2)
> +     $fallible_impl:ident,
> +     $string_like:ident) => (
> +        fn $name<T: $string_like + ?Sized>(&mut self, other: &T, old_len: usize) {
> +            let adapter = other.adapt();
> +            let other_slice = adapter.as_ref();
> +            let num_ascii = if adapter.is_abstract() && old_len == 0 {

The motivation behind this `is_abstract` check is that we can potentially re-use the buffer from the original string? Is that correct?

::: servo/support/gecko/nsstring/src/conversions.rs:216
(Diff revision 2)
> +                    // Calling something whose argument can be obtained from
> +                    // the adapter rather than an nsStringLike avoids a huge
> +                    // lifetime mess by keeping nsStringLike and
> +                    // Latin1StringLike free of lifetime interdependencies.
> +                    unsafe {
> +                        Gecko_AssignCString(self, adapter.as_ptr());
> +                    }
> +                    return;

This is mildly gross, not the biggest fan of doing direct calls into C++ here.

I presume that wrapping the adapted pointer into a new nsStringLike doesn't work super well here?

::: servo/support/gecko/nsstring/src/conversions.rs:360
(Diff revision 2)
> +
> +impl nsACString {
> +
> +    // UTF-16 to UTF-8
> +
> +    expanding_conversion!(append_utf16_to_utf8_impl, fallible_append_utf16_to_utf8_impl, convert_utf16_to_utf8, copy_basic_latin_to_ascii, times_three_plus_one, &[u16]);

These macros are confusing, especially this one. Please document why these arguments are being used with some comments.

Also this is definitely over 80 columns wide, could we split it over a few lines?

::: servo/support/gecko/nsstring/src/conversions.rs:557
(Diff revision 2)
> +        self.fallible_append_utf8_to_latin1_lossy_check(other, len)
> +    }
> +
> +    // Latin1 to UTF-8 CString
> +
> +    ascii_copy_avoidance!(append_latin1_to_utf8_check, fallible_append_latin1_to_utf8_check, append_latin1_to_utf8_impl, fallible_append_latin1_to_utf8_impl, Latin1StringLike);

I don't think I understand why you can't use nsCStringLike here. Sure it would allow passing &str when you could append that more efficiently elsewhere, but that doesn't seem like a big deal to me.

::: servo/support/gecko/nsstring/src/lib.rs:395
(Diff revision 2)
> +            ///
> +            /// # Panics
> +            ///
> +            /// If the new length does not fit in `u32`.
> +            unsafe fn maybe_expand_capacity(&mut self, len: usize) -> &mut [$char_t] {
> +                if len == 0 {

The duplication here is unfortunate. I think I'd be comfortable just .expect()-ing on the result of the fallible_maybe_expand_capacity function.

::: servo/support/gecko/nsstring/src/lib.rs:963
(Diff revision 2)
> +/// This trait is implemented on types which are Latin1 `nsCString`-like,
> +/// in that they can at very low cost be converted to a borrowed
> +/// `&nsACString` and do not denote UTF-8ness in the Rust type system.
> +///
> +/// This trait is used to DWIM when calling the methods on
> +/// `nsACString`.

I presume that the intention here is that this type is only implemented on types which don't have a utf-8 constraint?

IIRC all of these types already implement nsCStringLike, except `&str, String, and Box<str>` happen to also implement nsCStringLike, making that the only difference.

::: servo/support/gecko/nsstring/src/lib.rs:979
(Diff revision 2)
> +impl<'a, T> Latin1StringLike for borrow::Cow<'a, T>
> +    where T: Latin1StringLike + borrow::ToOwned + ?Sized {
> +    fn adapt(&self) -> nsCStringAdapter {
> +        <T as Latin1StringLike>::adapt(self.as_ref())
> +    }
> +}
> +
> +impl Latin1StringLike for nsACString {
> +    fn adapt(&self) -> nsCStringAdapter {
> +        nsCStringAdapter::Abstract(self)
> +    }
> +}
> +
> +impl<'a> Latin1StringLike for nsCStr<'a> {
> +    fn adapt(&self) -> nsCStringAdapter {
> +        nsCStringAdapter::Abstract(self)
> +    }
> +}
> +
> +impl Latin1StringLike for nsCString {
> +    fn adapt(&self) -> nsCStringAdapter {
> +        nsCStringAdapter::Abstract(self)
> +    }
> +}
> +
> +impl Latin1StringLike for [u8] {
> +    fn adapt(&self) -> nsCStringAdapter {
> +        nsCStringAdapter::Borrowed(nsCStr::from(self))
> +    }
> +}
> +
> +impl Latin1StringLike for Vec<u8> {
> +    fn adapt(&self) -> nsCStringAdapter {
> +        nsCStringAdapter::Borrowed(nsCStr::from(&self[..]))
> +    }
> +}
> +
> +impl Latin1StringLike for Box<[u8]> {
> +    fn adapt(&self) -> nsCStringAdapter {
> +        nsCStringAdapter::Borrowed(nsCStr::from(&self[..]))
> +    }

As nsCStringLike should be implemented on all of these types, how about defining a macro to bulk implement this?

// something like:
macro_rules! latin1like {
    ($T:ty) => {
        impl<'a> Latin1StringLike for $T {
            fn adapt(&self) -> nsCStringAdapter {
                nsCStringLike::adapt(self)
            }
        }
    }
}

::: servo/support/gecko/nsstring/src/lib.rs:1105
(Diff revision 2)
>      fn Gecko_AssignCString(this: *mut nsACString, other: *const nsACString);
>      fn Gecko_TakeFromCString(this: *mut nsACString, other: *mut nsACString);
>      fn Gecko_AppendCString(this: *mut nsACString, other: *const nsACString);
>      fn Gecko_SetLengthCString(this: *mut nsACString, length: u32);
>      fn Gecko_BeginWritingCString(this: *mut nsACString) -> *mut u8;
> +    fn Gecko_MaybeExpandCapacityCString(this: *mut nsACString, length: *mut u32) -> *mut u8;

I think it would be cleaner to just expose SetCapacity to the rust code as-is, and then perform the other changes which you want to perform in the rust code. Adding this new method to the C++ side and then calling it from rust feels weird to me.

::: xpcom/string/nsTSubstring.h:597
(Diff revision 2)
> +  /**
> +   * If *aCapacity is larger than the current capacity, allocates a
> +   * buffer whose length is at least *aCapacity.
> +   *
> +   * Sets *aCapacity and the string's length to the actual capacity.
> +   *
> +   * Returns a pointer to the start of the buffer.
> +   *
> +   * MOZ_CRASHes on failure.
> +   *
> +   * Note that unlike GetMutableData, this rounds the length up to the
> +   * capacity.
> +   */
> +  inline char_type* MaybeExpandCapacity(size_type* aCapacity)
> +  {
> +    // SetCapacity unshares a shared buffer even then resizing is not
> +    // needed.
> +    SetCapacity(*aCapacity);
> +    size_type capacity = Capacity();
> +    // SetCapacity doesn't stretch the logical length for us.
> +    this->mLength = capacity;
> +    *aCapacity = capacity;
> +    char_type* ptr = base_string_type::mData;
> +    // SetCapacity zero-terminated at intermediate length, not capacity.
> +    ptr[capacity] = 0;
> +    return ptr;
> +  }

I don't think I fully understand this function, and I don't particularially like the name of it. As far as I can tell what this function is trying to do is reserve at least aCapacity bytes, and return a mutable buffer and the actual size of that buffer. The string isn't necessarially in a valid state, and SetLength must be called to set the length correctly before the actual data is written.

I think I'd prefer exposing SetCapacity() and Capacity() to rust code (as those are useful functions), and then implementing this as a helper function inside of the encoding code, as it probably won't be used elsewhere. I could imagine that not being able to optimize between the SetCapacity and Capacity() calls may not be fast enough though.

If we need to keep it in C++, perhaps call it MutateWithMinCapacity() or something like that and document that the length must be manually updated once the buffer has been filled in.

Nika Layzell [:nika] (ni? for response)

Comment 19

•

6 years ago

Overall stuff seems mostly OK - I like the new names for the public functions etc, but I have some questions about internal structuring of the logic, would like more internal documentation, and don't like the MaybeExpandCapacity function.

Flags: needinfo?(nika)

Henri Sivonen (:hsivonen)

Assignee

Comment 20

•

6 years ago

mozreview-review-reply

Comment on attachment 8944345 [details]
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust.

https://reviewboard.mozilla.org/r/214564/#review220970

> It feels like it might be cleaner to just use the options in the call sites for these functions, and then call .ok_or() at the call sites?

OK. (Looking forward ? converting None to Err(()) sometime in the future.)

> Consider moving the maybe_..._capacity stuff into rust code. It might look something like:
> 
> ```
> unsafe fn mut_buf_with_min_cap(str: &mut nsACString, min_cap: usize) -> Result<&mut [u8], ()> {
>     str.fallible_set_capacity(min_cap)?;
>     // Just poke at the struct from rust now, 'cause we can.
>     let cap = str.capacity();
>     let this = &mut nsCStringRepr = mem::transmute(str);
>     Ok(slice::from_raw_parts_mut(this.data as *mut u8, cap as usize))
> }
> ```
> 
> Then you can just call `set_length` when you're done writing into the buffer?

> ```
> unsafe fn mut_buf_with_min_cap(str: &mut nsACString, min_cap: usize) ->
> Result<&mut [u8], ()> {
>     str.fallible_set_capacity(min_cap)?;
>     // Just poke at the struct from rust now, 'cause we can.
>     let cap = str.capacity();
> ```

The above would be two calls to C++ instead of one. Do we care?

> Then you can just call `set_length` when you're done writing into the buffer?

It seemed questionable to expose a &mut [u8] that's longer than the string's logical length. That's my I made `MaybeExpandCapacity()` set the logical length to capacity and zero-terminate even though the length gets set again soon anyway.

> Please document these macros well. I want explanations as to what each of these macro arguments are intended to represent :-)

OK. Since the conversions ended up having more specialization that I initially expected, I think I'm going to de-macroize the macros that ended up with only one invocation. This will reduce the need for docs.

> It's probably fine to just call the fallible version and .expect() on that. I'm not a big fan of the complexity being added here, and that's what most of the code on the C++ side does anyway.
> 
> I mostly had the separate implementations before on the rust side because they weren't too repetitive, and I figured it might be slightly more efficient.

OK. Thanks.

> The motivation behind this `is_abstract` check is that we can potentially re-use the buffer from the original string? Is that correct?

Yes and, specifically, if we already know it's _not_ abstract, we can start copying during the initial ASCII check instead of having to perform a non-copying ASCII check first.

> This is mildly gross, not the biggest fan of doing direct calls into C++ here.
> 
> I presume that wrapping the adapted pointer into a new nsStringLike doesn't work super well here?

I tried to make a safe-Rust conversion from `Latin1StringLike` to `nsCStringLike` and got into too much of a lifetime mess. As far as I can tell, rewrapping the raw pointer here would be a lifetime cheat, so why would it be less gross than the straight-forward thing I did here? (The lifetime is right in any case, but it would be a lifetime cheat in the sense of not letting the Rust compiler know that.)

> These macros are confusing, especially this one. Please document why these arguments are being used with some comments.
> 
> Also this is definitely over 80 columns wide, could we split it over a few lines?

OK. I haven't run rustfmt on this yet.

> I don't think I understand why you can't use nsCStringLike here. Sure it would allow passing &str when you could append that more efficiently elsewhere, but that doesn't seem like a big deal to me.

This is bit is not about efficiency. &str means UTF-8 string, so it's wrong to pass it to a function that wants a Latin1 string. Obviously, &[u8] might have something other than Latin1 in it, but at least in that case, we aren't treating something that's explicitly non-Latin1 as Latin1.

> The duplication here is unfortunate. I think I'd be comfortable just .expect()-ing on the result of the fallible_maybe_expand_capacity function.

Great. I'll use `expect()`.

> I presume that the intention here is that this type is only implemented on types which don't have a utf-8 constraint?
> 
> IIRC all of these types already implement nsCStringLike, except `&str, String, and Box<str>` happen to also implement nsCStringLike, making that the only difference.

Yes and yes.

> As nsCStringLike should be implemented on all of these types, how about defining a macro to bulk implement this?
> 
> // something like:
> macro_rules! latin1like {
>     ($T:ty) => {
>         impl<'a> Latin1StringLike for $T {
>             fn adapt(&self) -> nsCStringAdapter {
>                 nsCStringLike::adapt(self)
>             }
>         }
>     }
> }

Yeah, this could be macroized.

> I think it would be cleaner to just expose SetCapacity to the rust code as-is, and then perform the other changes which you want to perform in the rust code. Adding this new method to the C++ side and then calling it from rust feels weird to me.

I tried to minimize the number of calls from Rust to C++. Maybe it's not worthwhile to minimize their number?

> I don't think I fully understand this function, and I don't particularially like the name of it. As far as I can tell what this function is trying to do is reserve at least aCapacity bytes, and return a mutable buffer and the actual size of that buffer. The string isn't necessarially in a valid state, and SetLength must be called to set the length correctly before the actual data is written.
> 
> I think I'd prefer exposing SetCapacity() and Capacity() to rust code (as those are useful functions), and then implementing this as a helper function inside of the encoding code, as it probably won't be used elsewhere. I could imagine that not being able to optimize between the SetCapacity and Capacity() calls may not be fast enough though.
> 
> If we need to keep it in C++, perhaps call it MutateWithMinCapacity() or something like that and document that the length must be manually updated once the buffer has been filled in.

> As far as I can tell what this function is trying to do is reserve at least aCapacity bytes, and return a mutable buffer and the actual size of that buffer.

Yes

>    The string isn't necessarially in a valid state, and SetLength must be called to set the length correctly before the actual data is written.

And the function puts the string in a valid state by setting `this->mLength` and zero-terminating. Indeed, this soon gets overwritten with a `SetLength` call, but it seemed wrong to leave the string in an invalid state. Should I leave it in an invalid state?

Henri Sivonen (:hsivonen)

Assignee

Comment 21

•

6 years ago

> I think I'd prefer exposing SetCapacity() and Capacity() to rust code (as those are useful functions), and then implementing this as a helper function inside of the encoding code, as it probably won't be used elsewhere.

I'd like the "stretch buffer to at least this capacity and expose a mutable slice with the actual rounded-up capacity" method to be public so that encoding_glue could move over to using it instead of trying to guess the allocation bucket sizes.

Comment hidden (mozreview-request)

Comment on attachment 8944345 [details]
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/214564/diff/2-3/

Henri Sivonen (:hsivonen)

Assignee

Comment 23

•

6 years ago

Due to issues mentioned above, I didn't redo the buffer resizing yet.

Henri Sivonen (:hsivonen)

Assignee

Comment 24

•

6 years ago

Is there a specific naming convention that I should use for the `extern "C"` wrappers for exposing these methods to C++?

Flags: needinfo?(nika)

Henri Sivonen (:hsivonen)

Assignee

Comment 25

•

6 years ago

For e.g. these four methods:
    /// Convert a potentially-invalid UTF-16 string into valid UTF-8
    /// (replacing invalid sequences with the REPLACEMENT CHARACTER) and
    /// replace the content of this string with the conversion result.
    pub fn assign_utf16_to_utf8(&mut self, other: &[u16]) {
        self.fallible_append_utf16_to_utf8_impl(other, 0).expect("Out of memory");
    }

    /// Convert a potentially-invalid UTF-16 string into valid UTF-8
    /// (replacing invalid sequences with the REPLACEMENT CHARACTER) and
    /// fallibly replace the content of this string with the conversion result.
    pub fn fallible_assign_utf16_to_utf8(&mut self, other: &[u16]) -> Result<(), ()> {
        self.fallible_append_utf16_to_utf8_impl(other, 0)
    }

    /// Convert a potentially-invalid UTF-16 string into valid UTF-8
    /// (replacing invalid sequences with the REPLACEMENT CHARACTER) and
    /// append the conversion result to this string.
    pub fn append_utf16_to_utf8(&mut self, other: &[u16]) {
        let len = self.len();
        self.fallible_append_utf16_to_utf8_impl(other, len).expect("Out of memory");
    }

    /// Convert a potentially-invalid UTF-16 string into valid UTF-8
    /// (replacing invalid sequences with the REPLACEMENT CHARACTER) and
    /// fallibly append the conversion result to this string.
    pub fn fallible_append_utf16_to_utf8(&mut self, other: &[u16]) -> Result<(), ()> {
        let len = self.len();
        self.fallible_append_utf16_to_utf8_impl(other, len)
    }

I think I should expose just fallible_append_utf16_to_utf8_impl to C++ and re-create the four-way wrapping there (fallible vs. infallible x assign vs. append).

Nika Layzell [:nika] (ni? for response)

Comment 26

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) (Away from Bugzilla until 2018-01-30) from comment #24)
> Is there a specific naming convention that I should use for the `extern "C"`
> wrappers for exposing these methods to C++?

Not exactly. Servo uses `Servo_TheMethodInCamelCase` and webrender uses `wr_the_name_in_snake_case`. For calling into the C++ side of nsstring I've been following the servo convention with Gecko_BLAH - perhaps try `Rust_BLAH`?

(In reply to Henri Sivonen (:hsivonen) (Away from Bugzilla until 2018-01-30) from comment #25)
> I think I should expose just fallible_append_utf16_to_utf8_impl to C++ and
> re-create the four-way wrapping there (fallible vs. infallible x assign vs.
> append).

I agree

Flags: needinfo?(nika)

Henri Sivonen (:hsivonen)

Assignee

Comment 27

•

6 years ago

(In reply to Nika Layzell [:mystor] from comment #26)
> (In reply to Henri Sivonen (:hsivonen) (Away from Bugzilla until 2018-01-30)
> from comment #24)
> > Is there a specific naming convention that I should use for the `extern "C"`
> > wrappers for exposing these methods to C++?
> 
> Not exactly. Servo uses `Servo_TheMethodInCamelCase` and webrender uses
> `wr_the_name_in_snake_case`. For calling into the C++ side of nsstring I've
> been following the servo convention with Gecko_BLAH - perhaps try
> `Rust_BLAH`?

If WebRender is allowed to use snake case, it seems to me that nstring_blah_blah would avoid bothering casing lints.

Henri Sivonen (:hsivonen)

Assignee

Comment 28

•

6 years ago

I meant nsstring_blah_blah.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/217884/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/217884/

Comment hidden (mozreview-request)

Comment on attachment 8948435 [details]
Bug 1402247 part 3 - Call Rust-based append/assign code from nsReadableUtils.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/217884/diff/1-2/

Henri Sivonen (:hsivonen)

Assignee

Comment 31

•

6 years ago

It seems that our codebase has some remains of support for strings whose buffer isn't contiguous. It also seems that those appear in conversions exceptionally rarely, so I'm going ahead and requiring the conversion to use contiguous inputs.

Henri Sivonen (:hsivonen)

Assignee

Comment 32

•

6 years ago

It turns out that bug 586838 has landed in January, so I guess I now need to get encoding_rs to use NEON on ARMv7 in order not to regress performance there...

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/218084/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/218084/

Code Review Bot [:reviewbot]

Comment 34

•

6 years ago

mozreview-review

Comment on attachment 8948690 [details]
Bug 1402247 part 4 - Use encoding_rs::mem for string conversion that yield plain C strings.

https://reviewboard.mozilla.org/r/218084/#review223914


Code analysis found 5 defects in this patch:
 - 5 defects found by clang-tidy

You can run this analysis locally with:
 - `./mach static-analysis check path/to/file.cpp` (C/C++)


If you see a problem in this automated review, please report it here: http://bit.ly/2y9N9Vx


::: xpcom/string/nsReadableUtils.cpp:42
(Diff revision 1)
>  
>  
>  char*
>  ToNewCString(const nsAString& aSource)
>  {
> -  char* result = AllocateStringCopy(aSource, (char*)0);
> +  char* dest = AllocateStringCopy(aSource, (char*)0);

Warning: Use nullptr [clang-tidy: modernize-use-nullptr]

  char* dest = AllocateStringCopy(aSource, (char*)0);
                                                  ^
                                                  nullptr

::: xpcom/string/nsReadableUtils.cpp:92
(Diff revision 1)
>  char*
>  ToNewCString(const nsACString& aSource)
>  {
>    // no conversion needed, just allocate a buffer of the correct length and copy into it
>  
> -  char* result = AllocateStringCopy(aSource, (char*)0);
> +  char* dest = AllocateStringCopy(aSource, (char*)0);

Warning: Use nullptr [clang-tidy: modernize-use-nullptr]

  char* dest = AllocateStringCopy(aSource, (char*)0);
                                                  ^
                                                  nullptr

::: xpcom/string/nsReadableUtils.cpp:108
(Diff revision 1)
>  char16_t*
>  ToNewUnicode(const nsAString& aSource)
>  {
>    // no conversion needed, just allocate a buffer of the correct length and copy into it
>  
> -  char16_t* result = AllocateStringCopy(aSource, (char16_t*)0);
> +  char16_t* dest = AllocateStringCopy(aSource, (char16_t*)0);

Warning: Use nullptr [clang-tidy: modernize-use-nullptr]

  char16_t* dest = AllocateStringCopy(aSource, (char16_t*)0);
                                                          ^
                                                          nullptr

::: xpcom/string/nsReadableUtils.cpp:122
(Diff revision 1)
>  }
>  
>  char16_t*
>  ToNewUnicode(const nsACString& aSource)
>  {
> -  char16_t* result = AllocateStringCopy(aSource, (char16_t*)0);
> +  char16_t* dest = AllocateStringCopy(aSource, (char16_t*)0);

Warning: Use nullptr [clang-tidy: modernize-use-nullptr]

  char16_t* dest = AllocateStringCopy(aSource, (char16_t*)0);
                                                          ^
                                                          nullptr

::: xpcom/string/nsReadableUtils.cpp:161
(Diff revision 1)
>  }
>  
>  char16_t*
>  UTF8ToNewUnicode(const nsACString& aSource, uint32_t* aUTF16Count)
>  {
> -  const uint32_t length = CalcUTF8ToUnicodeLength(aSource);
> +  char16_t* dest = AllocateStringCopy(aSource, (char16_t*)0);

Warning: Use nullptr [clang-tidy: modernize-use-nullptr]

  char16_t* dest = AllocateStringCopy(aSource, (char16_t*)0);
                                                          ^
                                                          nullptr

Comment hidden (mozreview-request)

Comment on attachment 8948690 [details]
Bug 1402247 part 4 - Use encoding_rs::mem for string conversion that yield plain C strings.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/218084/diff/1-2/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1436330

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/2-3/

Comment hidden (mozreview-request)

Comment on attachment 8944345 [details]
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/214564/diff/3-4/

Comment hidden (mozreview-request)

Comment on attachment 8948435 [details]
Bug 1402247 part 3 - Call Rust-based append/assign code from nsReadableUtils.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/217884/diff/2-3/

Comment hidden (mozreview-request)

Comment on attachment 8948690 [details]
Bug 1402247 part 4 - Use encoding_rs::mem for string conversion that yield plain C strings.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/218084/diff/2-3/

Comment hidden (mozreview-request)

This accomplishes two things:
 1) For types that provide operator overloading conversion to Span and to
    const char*/const char16_t*, the conversion ambiguity is removed.
 2) Requiring the use of MakeStringSpan() if the caller doesn't have the
    length makes the cost more explicit and also discourages the use of
    strings that don't know their length.

Review commit: https://reviewboard.mozilla.org/r/218714/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/218714/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/3-4/

Comment hidden (mozreview-request)

Comment on attachment 8944345 [details]
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/214564/diff/4-5/

Comment hidden (mozreview-request)

Comment on attachment 8948435 [details]
Bug 1402247 part 3 - Call Rust-based append/assign code from nsReadableUtils.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/217884/diff/3-4/

Comment hidden (mozreview-request)

Comment on attachment 8948690 [details]
Bug 1402247 part 4 - Use encoding_rs::mem for string conversion that yield plain C strings.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/218084/diff/3-4/

Comment hidden (mozreview-request)

Comment on attachment 8949345 [details]
Bug 1402247 part 5 - Remove const char* / const char16_t versions of string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/218714/diff/1-2/

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/224298/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/224298/

Comment hidden (mozreview-request)

Comment on attachment 8955143 [details]
Bug 1402247 part 6 - Fix fatal bugs.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/224298/diff/1-2/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1445692

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/227766/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/227766/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1445936

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/228066/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/228066/

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/228068/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/228068/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/4-5/

Comment hidden (mozreview-request)

Comment on attachment 8944345 [details]
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/214564/diff/5-6/

Comment hidden (mozreview-request)

Comment on attachment 8948435 [details]
Bug 1402247 part 3 - Call Rust-based append/assign code from nsReadableUtils.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/217884/diff/4-5/

Comment hidden (mozreview-request)

Comment on attachment 8948690 [details]
Bug 1402247 part 4 - Use encoding_rs::mem for string conversion that yield plain C strings.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/218084/diff/4-5/

Comment hidden (mozreview-request)

Comment on attachment 8949345 [details]
Bug 1402247 part 5 - Remove const char* / const char16_t versions of string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/218714/diff/2-3/

Comment hidden (mozreview-request)

Comment on attachment 8955143 [details]
Bug 1402247 part 6 - Fix fatal bugs.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/224298/diff/2-3/

Comment hidden (mozreview-request)

Comment on attachment 8958890 [details]
Bug 1402247 part 7 - Use the new code from XPConnect.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/227766/diff/1-2/

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/228106/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/228106/

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/228886/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/228886/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1423877

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/229426/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/229426/

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/229696/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/229696/

Comment hidden (mozreview-request)

Comment on attachment 8960945 [details]
Bug 1402247 part 11 - Use U+FFFD when atomizing invalid UTF-8.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/229696/diff/1-2/

Henri Sivonen (:hsivonen)

Assignee

Comment 63

•

6 years ago

mozreview-review-reply

Comment on attachment 8944345 [details]
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust.

https://reviewboard.mozilla.org/r/214564/#review220970

> OK. (Looking forward ? converting None to Err(()) sometime in the future.)

This will be fixed in the next patch upload.

> > ```
> > unsafe fn mut_buf_with_min_cap(str: &mut nsACString, min_cap: usize) ->
> > Result<&mut [u8], ()> {
> >     str.fallible_set_capacity(min_cap)?;
> >     // Just poke at the struct from rust now, 'cause we can.
> >     let cap = str.capacity();
> > ```
> 
> The above would be two calls to C++ instead of one. Do we care?
> 
> > Then you can just call `set_length` when you're done writing into the buffer?
> 
> It seemed questionable to expose a &mut [u8] that's longer than the string's logical length. That's my I made `MaybeExpandCapacity()` set the logical length to capacity and zero-terminate even though the length gets set again soon anyway.

I'm still scared of exposing C++ APIs that leave the string in an invalid state and expect the Rust side to clean up.

> OK. Since the conversions ended up having more specialization that I initially expected, I think I'm going to de-macroize the macros that ended up with only one invocation. This will reduce the need for docs.

Will be documented in the next patch upload.

> OK. Thanks.

Already fixed.

> OK. I haven't run rustfmt on this yet.

The next patch upload will have been through rustfmt.

> Yeah, this could be macroized.

I think I've done what was requested.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/5-6/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1448755

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/6-7/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1448772

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1448786

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/7-8/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/8-9/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/9-10/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/10-11/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/11-12/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/12-13/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/13-14/

Henri Sivonen (:hsivonen)

Assignee

Comment 73

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #20)
> >    The string isn't necessarially in a valid state, and SetLength must be called to set the length correctly before the actual data is written.
> 
> And the function puts the string in a valid state by setting `this->mLength`
> and zero-terminating. Indeed, this soon gets overwritten with a `SetLength`
> call, but it seemed wrong to leave the string in an invalid state. Should I
> leave it in an invalid state?

After seeing that the fixed cost of the new code is higher than the fixed cost of the old code, I'm now more keen on trying tricks like making the capacity expansion on the C++ leave string in an invalid state and then putting it in a valid state by writing the length from Rust without FFI overhead and doing the zero termination in Rust after writing (as opposed to doing a cache-bothering useless zero-termination ahead of time).

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/234896/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/234896/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/14-15/

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/251390/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/251390/

Comment hidden (mozreview-request)

Comment on attachment 8985898 [details]
Bug 1402247 bis - Rework string buffer allocation from Rust.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/251390/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/15-16/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Attachment #8966164 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Attachment #8985898 - Attachment is obsolete: true

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/16-17/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/17-18/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/18-19/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1397807

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/19-20/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1469512

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/20-21/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/21-22/

Henri Sivonen (:hsivonen)

Assignee

Comment 85

•

6 years ago

Greenish try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7127052a4fae881669e48f0dfc4cc01ea56ba25f

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/22-23/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/23-24/

Henri Sivonen (:hsivonen)

Assignee

Comment 88

•

6 years ago

nsTSubstring.h doesn't want to be included from nsWindowsWMain.cpp. See:
https://searchfox.org/mozilla-central/source/xpcom/string/nsTSubstring.h#21
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d558872b0d6f91175c676d46265587463a897c87&selectedJob=183937397

Is that #error still relevant these days? If it is, do I guess correctly that Rust code from gkrust isn't callable, either? Should I call a Windows system API for UTF-16 to UTF-8 conversion there?

Flags: needinfo?(mh+mozilla)

Nathan Froyd [:froydnj]

Comment 89

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #88)
> nsTSubstring.h doesn't want to be included from nsWindowsWMain.cpp. See:
> https://searchfox.org/mozilla-central/source/xpcom/string/nsTSubstring.h#21
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d558872b0d6f91175c676d46265587463a897c87&selectedJob=1
> 83937397
> 
> Is that #error still relevant these days? If it is, do I guess correctly
> that Rust code from gkrust isn't callable, either? Should I call a Windows
> system API for UTF-16 to UTF-8 conversion there?

Yes to all three of these.

Flags: needinfo?(mh+mozilla)

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/24-25/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/25-26/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/26-27/

Henri Sivonen (:hsivonen)

Assignee

Comment 93

•

6 years ago

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=224fbf677665&newProject=try&newRevision=49267a336b7cfc3a27a62729edf36862d9d514ab&framework=6

The perf results look good. Stuff gets better, except non-Latin UTF-16 to UTF-8 gets worse, but that's expected as a tradeoff of making the UTF-16 to UTF-8 ASCII case go a lot faster and I expect the other conversion pairs to be more important for non-ASCII going forward.

That said, I'll try to see if I can make non-Latin UTF-16 to UTF-8 better without losing the ASCII UTF-16 to UTF-8 wins.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/27-28/

Henri Sivonen (:hsivonen)

Assignee

Comment 95

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #93)
> That said, I'll try to see if I can make non-Latin UTF-16 to UTF-8 better
> without losing the ASCII UTF-16 to UTF-8 wins.

Bug 1470131 should provide a substantial improvement.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/28-29/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/29-30/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/30-31/

Henri Sivonen (:hsivonen)

Assignee

Comment 99

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b012f12e95a83b876abd64819a85b52a1869a925 is orange in seemingly random and locally non-reproducible ways, which suggests memory corruption. Yet, AFAICT, apart from rebasing this patch has only changed in Windows-specific ways since the greenish run in comment 85.

I suspect a bad automerge with bug 1447951, but I don't see anything obviously wrong.

I'm going to try going back to the base revision of the try run from comment 85 and seeing if the orange goes away.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/31-32/

Henri Sivonen (:hsivonen)

Assignee

Comment 101

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #99)
> I'm going to try going back to the base revision of the try run from comment
> 85 and seeing if the orange goes away.

The orange went away when going back to the old base revision:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0214733d942a05f4becdc04148f5646166e30e28

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/32-33/

Henri Sivonen (:hsivonen)

Assignee

Comment 103

•

6 years ago

The changeset right before bug 1447951 was green:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=6c29e4fc89c79a9a710eaa349f138cc84115ffd8

This patch rebased on top of that changeset:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c977ef982e637f90f967a79b69e9666eb0e56ebf

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/33-34/

Henri Sivonen (:hsivonen)

Assignee

Comment 105

•

6 years ago

Proper rebase:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c302d39bac1b0e034e076af916c47db4deb6aa9

So things went orange before bug 1447951.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/34-35/

Henri Sivonen (:hsivonen)

Assignee

Comment 107

•

6 years ago

Trying the revision immediately before encoding_rs 0.8.1 as the base:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=22d77500510c224fd2b55c2e7a701949fd8f66e3

Henri Sivonen (:hsivonen)

Assignee

Comment 108

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #107)
> Trying the revision immediately before encoding_rs 0.8.1 as the base:

With less infra failure:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eac914961fcfedfcbef1e20d7e5775f6fffcc0fc

Henri Sivonen (:hsivonen)

Assignee

Comment 109

•

6 years ago

Oops. I had not been fuzzing the revision I thought I had been fuzzing. encoding_rs 0.8.1 introduced bogus code.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1471533

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/35-36/

Henri Sivonen (:hsivonen)

Assignee

Comment 111

•

6 years ago

Mochitest 3 failure is a base revision bug:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=33d5b51d31be4659047e4106fcab07dbeca1b7ba

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Attachment #8943599 - Flags: review?(nika)

Attachment #8943599 - Flags: review?(nfroyd)

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/36-37/

Henri Sivonen (:hsivonen)

Assignee

Comment 113

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e14979195e605fa7e3ff63b2a8c48bdba9913883

Nathan Froyd [:froydnj]

Comment 114

•

6 years ago

mozreview-review

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review260020

This is, uh, a large patch.  Preliminary comments below.

::: testing/web-platform/meta/mimesniff/mime-types/parsing.any.js.ini:3702
(Diff revision 37)
> +  [ÿ/ÿ (Request/Response)]
> +    expected: FAIL
>  
> -[parsing.any.html]
> +  [/x (Request/Response)]
> -  max-asserts: 2469
> -  [TEXT/HTML;CHARSET=GBK (Blob/File)]

What in the world do all these wpt changes represent?

::: xpcom/base/CycleCollectedJSRuntime.cpp:1639
(Diff revision 37)
>    nsAutoCString stack;
>    JS::UniqueChars buf = JS::FormatStackDump(cx, nullptr, /* showArgs = */ false, /* showLocals = */ false, /* showThisProps = */ false);
>    stack.Append(buf.get());

Apparently `stack` is just dead code here?  I'm going to fix this in another bug.

::: xpcom/string/nsReadableUtils.cpp
(Diff revision 37)
> -void
> -CopyASCIItoUTF16(const char* aSource, nsAString& aDest)

What is the rationale for getting rid of all these `char*` and similar overloads?  All such changes seem to be doing is necessitating a lot of churn in the callers.

::: xpcom/string/nsTSubstring.h:903
(Diff revision 37)
> +  /**
> +   * Prepares mData to be mutated such that the capacity of the string
> +   * (not counting the zero-terminator) is at least aCapacity.
> +   * Returns the actual capacity, which may be larger than what was
> +   * requested or UINT32_MAX on allocation failure.
> +   *
> +   * mLength is ignored by this method. If the buffer is reallocated,
> +   * aUnitsToPreserve specifies how many code units to copy over to
> +   * the new buffer. The old buffer is freed if applicable.
> +   *
> +   * Unless the return value is UINT32_MAX to signal failure, this
> +   * method the string in an invalid state! The caller is reponsible
> +   * for zero-terminating the string and setting a valid mLength
> +   * by calling FinishBulkWrite(). This method sets the flag to claim
> +   * that the string is zero-terminated before it actually is.
> +   *
> +   * Once this method has been called and before FinishBulkWrite()
> +   * has been called, only calls to Data() or this method again
> +   * are valid. Do not call any other methods between calling this
> +   * method and FinishBulkWrite().
> +   */
> +  uint32_t NS_FASTCALL StartBulkWrite(size_type aCapacity, size_type aUnitsToPreserve);
> +
> +  /**
> +   * Restores the string to a valid state after a call to BulkWritePrep()
> +   * that returned a non-UINT32_MAX value. The argument to this method
> +   * must be less than or equal to the non-UINT32_MAX value returned by
> +   * the most recent BulkWritePrep() call.
> +   */
> +  inline void FinishBulkWrite(size_type aLength) {
> +    MOZ_ASSERT(aLength != UINT32_MAX, "OOM magic value passed as length.");
> +    base_string_type::mData[aLength] = 0;
> +    base_string_type::mLength = aLength;
> +    AssertValid();
> +  }

These are helpful documentation comments, thank you for writing them.  But I don't understand why these methods are useful/needed vs. other methods that exist in the public API or non-priate APIs like `MutatePrep`, nor do I see any justification for them in commit comments or this documentation.  What are these buying you over and above the methods that already exist?

::: xpcom/string/nsTSubstring.h:927
(Diff revision 37)
> +   * Restores the string to a valid state after a call to BulkWritePrep()
> +   * that returned a non-UINT32_MAX value. The argument to this method
> +   * must be less than or equal to the non-UINT32_MAX value returned by
> +   * the most recent BulkWritePrep() call.

This comment should be referencing `StartBulkWrite` instead of `BulkWritePrep`?

::: xpcom/string/nsTSubstring.cpp:52
(Diff revision 37)
>    return static_cast<const nsTAutoString<T>*>(aStr);
>  }
>  
> +template <typename T>
> +uint32_t
> +nsTSubstring<T>::StartBulkWrite(size_type aCapacity, size_type aUnitsToPreserve)

Given the above comment on the documentation, and given that this duplicates significant amounts of `MutatePrep`, I think this API needs to be refactored.

Henri Sivonen (:hsivonen)

Assignee

Comment 115

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review260020

> What in the world do all these wpt changes represent?

These are MIME type parsing tests that before this patch pass for the wrong reason. (The questionable failure mode of the old UTF-8 to UTF-16 conversion code happened to look like the pass condition.) In bug 1423877 comment 5 annevk OKed marking these as expected failures.

> What is the rationale for getting rid of all these `char*` and similar overloads?  All such changes seem to be doing is necessitating a lot of churn in the callers.

The original rationale is avoiding some overload ambiguity error. But regardless of that, I think it makes sense not to hide the `strlen` from the callers and make call sites that don't have the length have to take an action to find out the length (by calling `MakeStringSpan`). The cost of hidden `strlen` in hard to quantify, but it can't be good for our code. Hopefully this change discourages the use of string that don't know their length.

It's a bit unfortunate that this change makes this patch look large even though most of the changes arise from this detail.

> These are helpful documentation comments, thank you for writing them.  But I don't understand why these methods are useful/needed vs. other methods that exist in the public API or non-priate APIs like `MutatePrep`, nor do I see any justification for them in commit comments or this documentation.  What are these buying you over and above the methods that already exist?

`MutatePrep` does things I wanted to avoid. I wanted to get stuff up and running before refactoring these back together, but I should have refactored them back together before requesting review. Sorry.

Henri Sivonen (:hsivonen)

Assignee

Comment 116

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review260020

> `MutatePrep` does things I wanted to avoid. I wanted to get stuff up and running before refactoring these back together, but I should have refactored them back together before requesting review. Sorry.

Specifically, `MutatePrep` may copy more data than is useful via `nsStringBuffer::Realloc`. `StartBulkWrite` copies only as much old data as is requested. Additionally, it avoids having to make a useless write of the zero terminator that will be shortly rewritten anyway (and the other write will be nicer in terms of cache access pattern, too).

Except for `ReplacePrepInternal`, the two other existing callers of `MutatePrep` could be migrated over to `StartBulkWrite` and `FinishBulkWrite`. Maybe I should add some more arguments to `StartBulkWrite` to make it capable of doing the things that `ReplacePrepInternal` wants to do.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/37-38/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/38-39/

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Blocks: 1472113

Henri Sivonen (:hsivonen)

Assignee

Comment 119

•

6 years ago

The current semantics of SetCapacity() are just awful. Filed bug 1472113.

Henri Sivonen (:hsivonen)

Assignee

Comment 120

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review260020

> Given the above comment on the documentation, and given that this duplicates significant amounts of `MutatePrep`, I think this API needs to be refactored.

Refactored. The patch is now even larger, but at least in the process I discovered that we could make SetCapacity() memcpy less if we audited the callers to behave more reasonably.

Henri Sivonen (:hsivonen)

Assignee

Comment 121

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review260020

> Specifically, `MutatePrep` may copy more data than is useful via `nsStringBuffer::Realloc`. `StartBulkWrite` copies only as much old data as is requested. Additionally, it avoids having to make a useless write of the zero terminator that will be shortly rewritten anyway (and the other write will be nicer in terms of cache access pattern, too).
> 
> Except for `ReplacePrepInternal`, the two other existing callers of `MutatePrep` could be migrated over to `StartBulkWrite` and `FinishBulkWrite`. Maybe I should add some more arguments to `StartBulkWrite` to make it capable of doing the things that `ReplacePrepInternal` wants to do.

Refactored as planned in the previous comment.

Henri Sivonen (:hsivonen)

Assignee

Comment 122

•

6 years ago

Linux try run looks good:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d48371e71f69041fe94b52eca583935ee076e7e

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/39-40/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/40-41/

Henri Sivonen (:hsivonen)

Assignee

Comment 125

•

6 years ago

Unsurprisingly, the latest allocation strategy is not great for non-ASCII Latin that starts with an ASCII code point. Maybe instead of checking the first code unit, the opportunistic ASCIIness guess should look at the first 16 code units using SIMD.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/41-42/

Henri Sivonen (:hsivonen)

Assignee

Comment 127

•

6 years ago

This was the bad strategy:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=e9e6b3bfa18f7f39fcf2c083a6b22770488a709f&framework=6&filter=hundred&selectedTimeRange=172800

Try run with Basic Latin checking for up to two cache lines when converting from UTF-16 to UTF-8:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=63552be9b4f7507bf109e251f3f5a3244218874c

I think I should also add a threshold so that the buffer shrinking doesn't happen if the difference between the current buffer size and the size it could be shrunk to is too small bother. I suspect a threshold should is somewhere above 200 bytes but below 1000 bytes. With power-of-two buckets, that means either between the 256 and 512 buckets or between the 512 and 1024 buckets. I'm going to guess between the 256 and 512 buckets.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/42-43/

Henri Sivonen (:hsivonen)

Assignee

Comment 129

•

6 years ago

Introduce a threshold for shrinking head-allocated buffers:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc2010d4cb866351a4e43220973d0d0a12d33e74

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/43-44/

Henri Sivonen (:hsivonen)

Assignee

Comment 131

•

6 years ago

Compute cache lines properly for UTF-16:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=228ae8a6cf2f69847ce378167341bae974ea11e7

- -

This bug has gone on for a large number of comments, so I figured it might be good to write down what this actually accomplishes.

Correctness:

* UTF errors are handled safely per spec instead of dangerously truncating strings. (Even some OOM cases take care to turn strings into safe single REPLACEMENT CHARACTER strings instead of making strings empty.)

* There are fewer converter implementations, so if there are correctness bugs, there are fewer places to fix. (C++ UTF-8 code and C++ UTF-16 code wasn't completely eliminated: The code for iterating over UTF-8 by scalar value and the code for iterating over UTF-16 by scalar value in C++ needs to stay as C++ to inline properly. I rewrote the code paths for correctness, though.)

Performance:

* The old code did exact buffer length math, which meant doing UTF math twice on each input string (once for length calculation and another time for conversion). Exact length math is more complicated when handling errors properly, which the old code didn't do. The new code does UTF math on the string content only once (when converting) but risks allocating more than once. There are heuristics in place to lower the probability of reallocation in cases where the double math avoidance isn't enough of a saving to absorb an allocation and memcpy.

* Previously, in UTF-16 <-> UTF-8 conversions, an ASCII prefix was optimized but a single non-ASCII code point pessimized the math for the rest of the string. The new code tries to get back on the fast ASCII path.

* UTF-16 to Latin1 conversion guarantees less about handling of out-of-range input to eliminate an operation from the inner loop on x86/x86_64.

* When assigning to a pre-existing string, the new code tries to reuse the old buffer instead of first releasing the old buffer and then allocating a new one.

* When reallocating from the new code, the memcpy covers only the data that is part of the logical length of the old string instead of memcpying the whole capacity. (For old callers old excess memcpy behavior is preserved due to bogus callers. See bug 1472113.)

Features:

* Conversion between UTF-8 and Latin1 is added in order to enable faster future interop between Rust code (or otherwise UTF-8-using code) and text node and SpiderMonkey code that uses Latin1.

Henri Sivonen (:hsivonen)

Assignee

Comment 132

•

6 years ago

Oh, and under "Performance":

 * UTF-8 strings in XPConnect that are in the Latin1 range are passed to SpiderMonkey as Latin1.

Henri Sivonen (:hsivonen)

Assignee

Comment 133

•

6 years ago

The perf numbers with the latest patch look pretty good. Some short string regressed, but the absolute time for those is quick anyway and the savings for long strings are large.
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=228ae8a6cf2f69847ce378167341bae974ea11e7&framework=6&selectedTimeRange=172800

There's still room to add special casing to make the short strings faster by making the conversion code aware of whether it's working with an autostring destination, but maybe that should be left to a follow-up instead of making this patch ever larger.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Blocks: 1473337

Henri Sivonen (:hsivonen)

Assignee

Comment 134

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #133)
> There's still room to add special casing to make the short strings faster by
> making the conversion code aware of whether it's working with an autostring
> destination, but maybe that should be left to a follow-up instead of making
> this patch ever larger.

Filed as bug 1473337 to keep the reviewable patch stable here.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Blocks: 1473340

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Blocks: 1473342

Nika Layzell [:nika] (ni? for response)

Comment 135

•

6 years ago

mozreview-review

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

Haven't had time to review the XPCOM C++ changes yet at all, but here's some comments on what I've looked at so far :-).

It'd be awesome if this could be split up a bit to make it a touch more digestable & less daunting to look over.

::: xpcom/string/nsReadableUtils.h:268
(Diff revision 2)
>   *
>   * @param aString a 8-bit wide string to scan
>   */
> -inline bool IsASCII(const nsACString& aString)
> +inline bool IsASCII(mozilla::Span<const uint8_t> aString)
>  {
> -  size_t length = aString.Length();
> +  return encoding_mem_is_ascii(aString.Elements(), aString.Length());

I presume here that you've worked around the call overhead pessimization issues?

::: docshell/base/nsDocShell.cpp:4774
(Diff revision 44)
>    nsAutoCString url;
>    if (aURI) {
>      nsresult rv = aURI->GetSpec(url);
>      NS_ENSURE_SUCCESS(rv, rv);
>    } else if (aURL) {
> -    CopyUTF16toUTF8(aURL, url);
> +    CopyUTF16toUTF8(MakeStringSpan(aURL), url);

It would be really nice if all of these API usage changes could be moved into a separate patch. It would make it much easier to review :-).

::: js/xpconnect/src/XPCConvert.cpp:282
(Diff revision 44)
>              return true;
>          }
>  
> -        if (utf8String->IsEmpty()) {
> +        uint32_t len = utf8String->Length();
> +
> +        if (!len) {

Please explicitly compare the length to 0 here :-)

::: js/xpconnect/src/XPCConvert.cpp:287
(Diff revision 44)
> -        const uint32_t len = CalcUTF8ToUnicodeLength(*utf8String);
> -        // The cString is not empty at this point, but the calculated
> +        CheckedInt<uint32_t> allocLen(len);
> +        allocLen += 1;

This might be more clear as:

auto allocLen = CheckedUint32(len) + 1;

::: js/xpconnect/src/XPCConvert.cpp:305
(Diff revision 44)
> -            // Copy or conversion during copy failed. Did not copy the
> -            // whole string.
> +            if (!buffer) {
> +                return false;
> +            }
> +            size_t written = LossyConvertUTF8toLatin1(*utf8String, MakeSpan(buffer, len));
> +            buffer[written] = 0;
> +            // JS_NewLatin1String takes ownership on success, i.e. a

nit: newline before this comment.

::: js/xpconnect/src/XPCConvert.cpp:318
(Diff revision 44)
> -        }
> +            }
> +            d.setString(str);
> +            return true;
> +        }
> +
> +        allocLen *= sizeof(char16_t);

Perhaps add a comment here explaining that at most one char16_t will be required per char.

I think it's pretty obvious but it might be nice to clarify the reasoning.

::: js/xpconnect/src/XPCConvert.cpp:332
(Diff revision 44)
> +        if (!buffer) {
> +            return false;
> +        }
> +        size_t written = ConvertUTF8toUTF16(*utf8String, MakeSpan(buffer, len + 1));
> +        buffer[written] = 0;
> +        // JS_NewUCStringDontDeflate takes ownership on success, i.e. a

nit: newline before this comment.

::: js/xpconnect/src/nsXPConnect.cpp:197
(Diff revision 44)
>  {
>      if (!aReport->filename)
>          mFileName.SetIsVoid(true);
>      else
> -        CopyASCIItoUTF16(aReport->filename, mFileName);
> +        CopyASCIItoUTF16(mozilla::MakeStringSpan(aReport->filename), mFileName);
> +        // XXX should the above use CopyUTF8toUTF16 instead?

I don't think it's necessary. The filenames for xpconnect errors should all be chrome files, which always have ASCII filenames IIRC.

That being said, it wouldn't be a big deal to use UTF8 instead of ASCII as well *shrug*

::: netwerk/protocol/http/nsHttpBasicAuth.cpp:81
(Diff revision 44)
>      NS_ENSURE_TRUE(isBasicAuth, NS_ERROR_UNEXPECTED);
>  
>      // we work with UTF-8 around here
>      nsAutoCString userpass;
> -    CopyUTF16toUTF8(user, userpass);
> +    if (user) {
> +      CopyUTF16toUTF8(mozilla::MakeStringSpan(user), userpass);

MakeStringSpan should handle null pointers just fine, so I think you can dodge this check.

::: security/manager/ssl/nsNSSCertHelper.cpp:912
(Diff revision 44)
>      (unsigned char*)PORT_ArenaZAlloc(arena.get(), utf8ValLen);
>    if (!PORT_UCS2_UTF8Conversion(
>          false, data, len, utf8Val, utf8ValLen, &utf8ValLen)) {
>      return NS_ERROR_FAILURE;
>    }
> -  AppendUTF8toUTF16((char*)utf8Val, text);
> +  AppendUTF8toUTF16(MakeSpan(reinterpret_cast<char*>(utf8Val), utf8ValLen), text);

Would it make sense for us to accept both signed and unsigned char spans in the AppendUTF8toUTF16 method? I don't think it should make a difference to us, and this reinterpret cast is pretty ugly :-/

::: servo/support/gecko/nsstring/src/conversions.rs:34
(Diff revision 44)
> +#[inline(always)]
> +fn plus_one(a: usize) -> Option<usize> {
> +    a.checked_add(1)
> +}
> +
> +const CACHE_LINE: usize = 64;

Please add a comment explaining where this number came from.

::: servo/support/gecko/nsstring/src/conversions.rs:36
(Diff revision 44)
> +    a.checked_add(1)
> +}
> +
> +const CACHE_LINE: usize = 64;
> +
> +const CACHE_LINE_MASK: usize = 63;

Could we just define this as CACHE_LINE - 1?

::: servo/support/gecko/nsstring/src/conversions.rs:51
(Diff revision 44)
> +    is_ascii(&buffer[..bound])
> +}
> +
> +#[inline(always)]
> +fn starts_with_basic_latin(buffer: &[u16]) -> bool {
> +    let len = buffer.len();

I'm pretty sure that caching .len() is completely unnecessary here. The function call will always be inlined & is a very simple cast/read.

Perhaps just use 'buffer.len()' in the places where you mention 'len'?

::: servo/support/gecko/nsstring/src/conversions.rs:61
(Diff revision 44)
> +/// A conversion where the number of code units in the output is potentially
> +/// smaller than the number of code units in the input.

Could you have a comment somewhere explaining what a "conversion" function should look like?

::: servo/support/gecko/nsstring/src/conversions.rs:72
(Diff revision 44)
> +    ($name:ident,
> +     $convert:ident,
> +     $other_ty:ty,
> +     $math:ident) => (

nit: Could we make this macro also include the names of the parameters so it's easier to see what's being passed, e.g:


    (name = $name:ident,
     convert = $convert:ident,
     other_ty = $other_ty:ty,
     math = $math:ident) => { ... }
     
It would make the callsites more clear :-)

::: servo/support/gecko/nsstring/src/conversions.rs:275
(Diff revision 44)
> +        other: &[u16],
> +        old_len: usize,
> +    ) -> Result<BulkWriteOk, ()> {
> +        // We first size the buffer for ASCII if the first cache line is ASCII. If that turns out not to
> +        // be enough, we size for the worst case given the length of the remaining input at that point.
> +        let (filled, num_ascii, mut handle) = if starts_with_basic_latin(other) {

I'm curious why you check for latin1 rather than ASCII here. Is the logic for checking if the first cacheline is basic latin faster than checking if it's valid ASCII?

::: servo/support/gecko/nsstring/src/conversions.rs:606
(Diff revision 44)
> +    this: *mut nsAString,
> +    other: *const u8,
> +    other_len: usize,
> +    old_len: usize,
> +) -> bool {
> +    let other_slice = ::std::slice::from_raw_parts(other, other_len);

nit: please add a `use std::slice` and just write `slice::from_raw_parts` here :-)

::: servo/support/gecko/nsstring/src/lib.rs:144
(Diff revision 44)
> +
> +/// A type for showing that `finish()` was called on a `BulkWriteHandle`.
> +/// Instantiating this type from elsewhere is basically an assertion that
> +/// there is no `BulkWriteHandle` around, so be very careful with instantiating
> +/// this type!
> +pub struct BulkWriteOk {

You can leave off the { } and just write:

    pub struct BulkWriteOk;

::: servo/support/gecko/nsstring/src/lib.rs:269
(Diff revision 44)
> +    /// calls `forget(self)`.
> +    fn drop(&mut self) {
> +        if self.capacity == 0 {
> +            return;
> +        }
> +        // Zero-length strings are special, so we can't make one

You mention that zero-length strings are special. That being said, I think we could probably get away with truncate()-ing the string in this case just fine, as that just frees whatever backing buffer was in use, and then switches it to point at the singleton static empty string.

::: servo/support/gecko/nsstring/src/lib.rs:361
(Diff revision 44)
>          /// This struct will leak its data if dropped from rust. See the module
>          /// documentation for more information on this type.
>          #[repr(C)]
>          #[derive(Debug)]
>          pub struct $StringRepr {
> -            data: *const $char_t,
> +            data: *mut $char_t,

Don't bother switching this to a `*mut`. We can just  perform an `as` cast to get the data as a `*mut` if we need it to be one.

BTW, now that `std::ptr::NonNull` is stable (since 1.25), we could actually switch this over to a `NonNull<$char_t>` which is covariant and provides some handy `as_ref` `as_mut` and `as_ptr` methods :-).

It'd also mean we'd get the null variant optimization in `Option<ns[C]String>`, which is kinda cool.

::: servo/support/gecko/nsstring/src/lib.rs:371
(Diff revision 44)
>  
>          impl $StringRepr {
>              fn new(classflags: ClassFlags) -> $StringRepr {
>                  static NUL: $char_t = 0;
>                  $StringRepr {
> -                    data: &NUL,
> +                    data: unsafe { mem::transmute(&NUL as *const $char_t) },

You can cast a `*const $char_t` to a `*mut $char_t` with  `as`, for example:

    data: &NUL as *const _ as *mut _,

::: servo/support/gecko/nsstring/src/lib.rs:406
(Diff revision 44)
> +        impl<'a> $BulkWriteHandle<'a> {
> +            fn new(string: &'a mut $AString, capacity: usize) -> Self {
> +                $BulkWriteHandle{ string: string, capacity: capacity }
> +            }
> +
> +            pub unsafe fn restart_bulk_write(&mut self, capacity: usize, units_to_preserve: usize, allow_shrinking: bool) -> Result<(), ()> {

nit: line length here and everywhere :-)

::: servo/support/gecko/nsstring/src/lib.rs:417
(Diff revision 44)
> +                if allow_shrinking {
> +                    unsafe {
> +                        let _ = self.restart_bulk_write(length, length, true);
> +                    }
> +                }
> +                // NOTE: Drop is implemented outside the macro earlier in this file

If we switch `Drop` to `truncate()`, we should be able to get away with defining it in the macro :-).

::: servo/support/gecko/nsstring/src/lib.rs:421
(Diff revision 44)
> +                        // The zero-argument case is OK even when the string
> +                        // is in invalid state.
> +                        self.string.set_length(0);

I think just calling `self.string.truncate()` will be more clear here.

::: servo/support/gecko/nsstring/src/lib.rs:431
(Diff revision 44)
> +                    let string_ptr = self.string as *mut $AString;
> +                    let this: *mut $StringRepr = mem::transmute(string_ptr);

let this = `self.string as *mut _ as *mut  $StringRepr` will work here and avoid the `transmute`.

This could also probably be moved into a private method on `nsA[C]String` called something along the lines of `fn as_repr(&self) -> NotNull<$StringRepr>`.

We could use it in `impl Deref for $AString` as well.

::: servo/support/gecko/nsstring/src/lib.rs:444
(Diff revision 44)
> +
> +            pub fn as_mut_slice(&mut self) -> &mut [$char_t] {
> +                unsafe {
> +                    // The borrow checker doesn't allow a reference to reference
> +                    // transmute, so let's use pointers.
> +                    let string_ptr = self.string as *mut $AString;

Again, could be greatly simplified by a shared helper method on `$AString`, and the transmute is unnecessary.

::: servo/support/gecko/nsstring/src/lib.rs:448
(Diff revision 44)
> +                    // transmute, so let's use pointers.
> +                    let string_ptr = self.string as *mut $AString;
> +                    let this: *mut $StringRepr = mem::transmute(string_ptr);
> +                    let ptr = (*this).data;
> +                    debug_assert!(!ptr.is_null(), "Should be non-null pointer even for zero capacity!");
> +                    ::std::slice::from_raw_parts_mut(ptr, self.capacity)

std::slice is in scope here, just use `slice::from_raw_parts_mut`.

::: servo/support/gecko/nsstring/src/lib.rs:563
(Diff revision 44)
>              pub fn to_mut(&mut self) -> &mut [$char_t] {
>                  unsafe {
>                      let len = self.len();
>                      if len == 0 {
> -                        // Use an arbitrary non-null value as the pointer
> -                        slice::from_raw_parts_mut(0x1 as *mut $char_t, 0)
> +                        // Use an arbitrary but aligned non-null value as the pointer
> +                        slice::from_raw_parts_mut(mem::align_of::<$char_t>() as *mut $char_t, 0)

I'm a touch confused why I did this here actually. IIRC our data pointer can't ever be null, so we should be able to just use it :-/.

Nevertheless, `std::ptr::NonNull::dangling()` would probably be cleaner now. https://doc.rust-lang.org/std/ptr/struct.NonNull.html#method.dangling

::: servo/support/gecko/nsstring/src/lib.rs:598
(Diff revision 44)
> +            /// Unshares the buffer of the string, sets the capacity to the
> +            /// allocation size resulting from rounding up `len`. Set the
> +            /// length of the string to the rounded-up capacity and returns
> +            /// the buffer as a mutable slice.
> +            ///
> +            /// Fails also if the new length doesn't fit in 32 bits.
> +            ///
> +            /// # Safety
> +            ///
> +            /// Unsafe because of exposure of uninitialized memory.

I think this comment should be on the public `bulk_write` not `start_bulk_write_impl`

::: servo/support/gecko/nsstring/src/lib.rs:608
(Diff revision 44)
> +            /// Fails also if the new length doesn't fit in 32 bits.
> +            ///
> +            /// # Safety
> +            ///
> +            /// Unsafe because of exposure of uninitialized memory.
> +            unsafe fn start_bulk_write_impl(&mut self, capacity: usize, units_to_preserve: usize, allow_shrinking: bool) -> Result<usize, ()> {

Feels to me like it would be cleaner to write ths in `bulk_write`, and then in the restart caller, use `mem::forget(self.string.bulk_write(...)?)` to avoid the other case.

::: servo/support/gecko/nsstring/src/lib.rs:722
(Diff revision 44)
>                  if s.is_empty() {
>                      return $Str::new();
>                  }
>                  $Str {
>                      hdr: $StringRepr {
> -                        data: s.as_ptr(),
> +                        data: unsafe { ::mem::transmute(s.as_ptr()) },

This transmute is unnecessary. `as` will do the job fine.

::: testing/web-platform/meta/mimesniff/mime-types/parsing.any.js.ini:8943
(Diff revision 44)
>      expected: FAIL
>  
> -  [/ (Blob/File)]
> +  [ÿ/x (Request/Response)]
>      expected: FAIL
>  
> -  [/ (Request/Response)]
> +  [x/ÿ (Request/Response)]

It'd be awesome if you could switch these test changes out into a different patch so I don't have to scroll through them :-)

::: toolkit/xre/nsWindowsRestart.cpp:36
(Diff revision 44)
> -  int len = strlen(arg);
> +  size_t len = strlen(arg);
>    char16_t *s = new char16_t[(len + 1) * sizeof(char16_t)];
>    if (!s)
>      return nullptr;
>  
> -  ConvertUTF8toUTF16 convert(s);
> +  size_t dstLen = ::MultiByteToWideChar(CP_UTF8, 0, arg, len, reinterpret_cast<wchar_t*>(s), len);

I presume this change was because updater.exe doesn't link to encoding_rs?

Attachment #8943599 - Flags: review?(nika) → review-

Henri Sivonen (:hsivonen)

Assignee

Comment 136

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

> It'd be awesome if this could be split up a bit to make it a touch more digestable & less daunting to look over.

I tried that first, but it was too messy to deal with multiple parts.

> I presume here that you've worked around the call overhead pessimization issues?

I haven't, but 1) `encoding_mem_is_ascii` is very fast and 2) Anthony Jones said on dev-platform that cross-language inlining is around the corner and we shouldn't try to optimize cross-language calls.

> I'm curious why you check for latin1 rather than ASCII here. Is the logic for checking if the first cacheline is basic latin faster than checking if it's valid ASCII?

I'm not checking for latin1 but for Basic Latin. Since Rust doesn't allow overloading, I use Basic Latin to mean the ASCII range of UTF-16 when the name with `ascii` is already taken by another function.

> You mention that zero-length strings are special. That being said, I think we could probably get away with truncate()-ing the string in this case just fine, as that just frees whatever backing buffer was in use, and then switches it to point at the singleton static empty string.

The point here is that if `capacity` is zero, the string is already a zero-length string. Otherwise, it would be potentially dangerous to make it a zero-length string, so the code tries hard to make the least dangerous non-empty string, which is a single REPLACEMENT CHARACTER.

> If we switch `Drop` to `truncate()`, we should be able to get away with defining it in the macro :-).

It's specifically trying to avoid `truncate`, and then the code unit representations of the REPLACEMENT CHARACTER differ.

> I'm a touch confused why I did this here actually. IIRC our data pointer can't ever be null, so we should be able to just use it :-/.
> 
> Nevertheless, `std::ptr::NonNull::dangling()` would probably be cleaner now. https://doc.rust-lang.org/std/ptr/struct.NonNull.html#method.dangling

OK. Last week, `dangling()` wasn't yet available for use on try, IIRC.

> I presume this change was because updater.exe doesn't link to encoding_rs?

Yes.

Henri Sivonen (:hsivonen)

Assignee

Comment 137

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

> Don't bother switching this to a `*mut`. We can just  perform an `as` cast to get the data as a `*mut` if we need it to be one.
> 
> BTW, now that `std::ptr::NonNull` is stable (since 1.25), we could actually switch this over to a `NonNull<$char_t>` which is covariant and provides some handy `as_ref` `as_mut` and `as_ptr` methods :-).
> 
> It'd also mean we'd get the null variant optimization in `Option<ns[C]String>`, which is kinda cool.

I was worried that casting from `*const` to `*mut` might be UB. Is that not actually a problem?

Henri Sivonen (:hsivonen)

Assignee

Comment 138

•

6 years ago

NI for the safety of casting from `*const` to `*mut`.

Flags: needinfo?(nika)

Henri Sivonen (:hsivonen)

Assignee

Comment 139

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

> Please explicitly compare the length to 0 here :-)

Fixed.

> This might be more clear as:
> 
> auto allocLen = CheckedUint32(len) + 1;

Changed.

> nit: newline before this comment.

Fixed.

> Perhaps add a comment here explaining that at most one char16_t will be required per char.
> 
> I think it's pretty obvious but it might be nice to clarify the reasoning.

Fixed.

> nit: newline before this comment.

Fixed.

> I don't think it's necessary. The filenames for xpconnect errors should all be chrome files, which always have ASCII filenames IIRC.
> 
> That being said, it wouldn't be a big deal to use UTF8 instead of ASCII as well *shrug*

Changed to CopyUTF8toUTF16.

> MakeStringSpan should handle null pointers just fine, so I think you can dodge this check.

Fixed.

> Please add a comment explaining where this number came from.

Added.

> Could we just define this as CACHE_LINE - 1?

Yes. Fixed.

> I'm pretty sure that caching .len() is completely unnecessary here. The function call will always be inlined & is a very simple cast/read.
> 
> Perhaps just use 'buffer.len()' in the places where you mention 'len'?

Changed. (Going to be overwritten by the follow-up anyway.)

> Could you have a comment somewhere explaining what a "conversion" function should look like?

Added comment.

> nit: Could we make this macro also include the names of the parameters so it's easier to see what's being passed, e.g:
> 
> 
>     (name = $name:ident,
>      convert = $convert:ident,
>      other_ty = $other_ty:ty,
>      math = $math:ident) => { ... }
>      
> It would make the callsites more clear :-)

Added.

> nit: please add a `use std::slice` and just write `slice::from_raw_parts` here :-)

Changed.

> You can leave off the { } and just write:
> 
>     pub struct BulkWriteOk;

Fixed.

> You can cast a `*const $char_t` to a `*mut $char_t` with  `as`, for example:
> 
>     data: &NUL as *const _ as *mut _,

Fixed.

> I think just calling `self.string.truncate()` will be more clear here.

Fixed.

> std::slice is in scope here, just use `slice::from_raw_parts_mut`.

Fixed.

> OK. Last week, `dangling()` wasn't yet available for use on try, IIRC.

Used `dangling()`.

> I think this comment should be on the public `bulk_write` not `start_bulk_write_impl`

Moved and rewritten.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/44-45/

Henri Sivonen (:hsivonen)

Assignee

Comment 141

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

> Would it make sense for us to accept both signed and unsigned char spans in the AppendUTF8toUTF16 method? I don't think it should make a difference to us, and this reinterpret cast is pretty ugly :-/

Unfortunately, the `char` variants in C and C++ are a disaster and they are simultaneously too different and too alike. I get compiler warnings if I try to introduce overloads that differ based on the signedness of the type parameter of the span. :-( Let's just live with this particular ugly reinterpret_cast.

Nathan Froyd [:froydnj]

Comment 142

•

6 years ago

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

I am not going to be able to look at this for a while; Eric can check out all the string logic, since that's something up his alley anyway.

Attachment #8943599 - Flags: review?(nfroyd) → review?(erahm)

Henri Sivonen (:hsivonen)

Assignee

Comment 143

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #138)
> NI for the safety of casting from `*const` to `*mut`.

Nevermind. NonNull addresses this anyway.

Flags: needinfo?(nika)

Henri Sivonen (:hsivonen)

Assignee

Comment 144

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

> I was worried that casting from `*const` to `*mut` might be UB. Is that not actually a problem?

Switched to `NonNull`.

> let this = `self.string as *mut _ as *mut  $StringRepr` will work here and avoid the `transmute`.
> 
> This could also probably be moved into a private method on `nsA[C]String` called something along the lines of `fn as_repr(&self) -> NotNull<$StringRepr>`.
> 
> We could use it in `impl Deref for $AString` as well.

> This could also probably be moved into a private method on nsA[C]String called something along the lines of fn as_repr(&self) -> NotNull<$StringRepr>.

Done.

> We could use it in impl Deref for $AString as well.

Not done due to mutability mismatch.

> Again, could be greatly simplified by a shared helper method on `$AString`, and the transmute is unnecessary.

Fixed.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/45-46/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/46-47/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/47-48/

Henri Sivonen (:hsivonen)

Assignee

Comment 148

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

> I haven't, but 1) `encoding_mem_is_ascii` is very fast and 2) Anthony Jones said on dev-platform that cross-language inlining is around the corner and we shouldn't try to optimize cross-language calls.

Added C++-side handling of the easy case for length < 16. These are easy to remove when cross-language inlining arrives for real.

> Feels to me like it would be cleaner to write ths in `bulk_write`, and then in the restart caller, use `mem::forget(self.string.bulk_write(...)?)` to avoid the other case.

Isn't that a bit too clever, i.e. harder to follow, than having a separate "impl" commonality method?

> This transmute is unnecessary. `as` will do the job fine.

Fixed.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/255258/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/255258/

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/255260/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/255260/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/48-49/

Henri Sivonen (:hsivonen)

Assignee

Comment 152

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

> It would be really nice if all of these API usage changes could be moved into a separate patch. It would make it much easier to review :-).

Done.

> nit: line length here and everywhere :-)

Manually reformatted this. (`cargo fmt` would have changed too much in this file.) Ran `cargo fmt` on `conversions.rs` and ran `mach clang-format` on the C++ changes.

> It'd be awesome if you could switch these test changes out into a different patch so I don't have to scroll through them :-)

Done.

Henri Sivonen (:hsivonen)

Assignee

Comment 153

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review220962

> > This could also probably be moved into a private method on nsA[C]String called something along the lines of fn as_repr(&self) -> NotNull<$StringRepr>.
> 
> Done.
> 
> > We could use it in impl Deref for $AString as well.
> 
> Not done due to mutability mismatch.

Aside: It's unclear to me what value we get from having distinct `nsAString` and `nsStringRepr` as opposed to making them the same type. I would hope Rust had some way to mark a type as neither copyable nor movable to avoid torn structs. Does it not?

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Attachment #8990224 - Flags: review?(nika)

Attachment #8990225 - Flags: review?(nika)

Attachment #8943599 - Flags: review?(nika)

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Attachment #8943599 - Flags: review?(nika)

Attachment #8943599 - Flags: review?(erahm)

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/49-50/

Henri Sivonen (:hsivonen)

Assignee

Comment 155

•

6 years ago

NI erahm, because Bugzilla keeps dropping the redirected review request.

Flags: needinfo?(erahm)

Comment hidden (mozreview-request)

Comment on attachment 8990224 [details]
Bug 1402247 - Big scroll-past parts.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/255258/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8990225 [details]
Bug 1402247 - Simple caller updates.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/255260/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/50-51/

Henri Sivonen (:hsivonen)

Assignee

Comment 159

•

6 years ago

Try run for perf numbers with current patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b2c7b881ba5d988666cee04736cc4e4f8006baad

Try run for perf numbers with patch from bug 1473337:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a847ed8c520698a38cf34646c9bbdf5ebd8942f

Nika Layzell [:nika] (ni? for response)

Comment 160

•

6 years ago

mozreview-review

Comment on attachment 8990225 [details]
Bug 1402247 - Simple caller updates.

https://reviewboard.mozilla.org/r/255260/#review262296

Attachment #8990225 - Flags: review?(nika) → review+

Nika Layzell [:nika] (ni? for response)

Comment 161

•

6 years ago

mozreview-review

Comment on attachment 8990224 [details]
Bug 1402247 - Big scroll-past parts.

https://reviewboard.mozilla.org/r/255258/#review262298

Attachment #8990224 - Flags: review?(nika) → review+

Eric Rahm [:erahm]

Comment 162

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) (away until 2018-08-06) from comment #155)
> NI erahm, because Bugzilla keeps dropping the redirected review request.

I can get to this next week, leaving ni so I remember.

Nika Layzell [:nika] (ni? for response)

Comment 163

•

6 years ago

mozreview-review

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review264118

::: js/xpconnect/src/XPCConvert.cpp:280
(Diff revision 51)
> -        if (utf8String->IsEmpty()) {
> +        uint32_t len = utf8String->Length();
> +
> +        if (len == 0) {

I dont really understand this change?

why not just use IsEmpty?

::: js/xpconnect/src/XPCConvert.cpp:303
(Diff revision 51)
> +            if (!buffer) {
> -            return false;
> +                return false;
> +            }
> +            size_t written =
> +                LossyConvertUTF8toLatin1(*utf8String, MakeSpan(buffer, len));
> +             buffer[written] = 0;

nit: indent

::: servo/support/gecko/nsstring/src/lib.rs:420
(Diff revision 51)
> +                                                      allow_shrinking)?;
> +                Ok(())
> +            }
> +
> +            pub fn finish(mut self, length: usize, allow_shrinking: bool) -> BulkWriteOk {
> +                // NOTE: Drop is implemented outside the macro earlier in this file

perhaps mention why it is implemented separately here?

::: servo/support/gecko/nsstring/src/lib.rs:579
(Diff revision 51)
>              pub fn fallible_to_mut(&mut self) -> Result<&mut [$char_t], ()> {
>                  unsafe {
>                      let len = self.len();
>                      if len == 0 {
> -                        // Use an arbitrary non-null value as the pointer
> -                        Ok(slice::from_raw_parts_mut(0x1 as *mut $char_t, 0))
> +                        // Use an arbitrary but aligned non-null value as the pointer
> +                        Ok(slice::from_raw_parts_mut(ptr::NonNull::<$char_t>::dangling().as_ptr() as *mut $char_t, 0))

can we wrap this line :-)?

::: xpcom/string/nsReadableUtils.cpp:33
(Diff revision 51)
>  template <class FromStringT, class ToCharT>
>  inline
>  ToCharT*
>  AllocateStringCopy(const FromStringT& aSource, ToCharT*)
>  {
> -  return static_cast<ToCharT*>(moz_xmalloc(
> +  // XXX can this overflow?

i believe that the string is required to fit in a fairly limited address space (https://searchfox.org/mozilla-central/source/xpcom/string/nsTSubstring.cpp#17-21)

::: xpcom/string/nsReadableUtils.cpp:42
(Diff revision 51)
>  
>  
>  char*
>  ToNewCString(const nsAString& aSource)
>  {
> -  char* result = AllocateStringCopy(aSource, (char*)0);
> +  char* dest = AllocateStringCopy(aSource, (char*)nullptr);

that cast should be unnecessary

::: xpcom/string/nsTSubstring.h
(Diff revision 51)
>  
>    /**
>     * buffer access
>     */
>  
> -

nit random whitespace change

::: xpcom/string/nsUTF8Utils.h:49
(Diff revision 51)
>    }
>    static bool is4byte(char aChar)
>    {
>      return (aChar & 0xF8) == 0xF0;
>    }
> +

nit: random whitespace change

Attachment #8943599 - Flags: review?(nika) → review+

Henri Sivonen (:hsivonen)

Assignee

Comment 164

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review264118

> I dont really understand this change?
> 
> why not just use IsEmpty?

Since the length needed to be read anyway, I thought I'd just inspect that result, but changed to `IsEmpty()`.

> nit: indent

Fixed.

> perhaps mention why it is implemented separately here?

Mentioned.

> can we wrap this line :-)?

Wrapped.

> i believe that the string is required to fit in a fairly limited address space (https://searchfox.org/mozilla-central/source/xpcom/string/nsTSubstring.cpp#17-21)

Revised comment accordingly.

> that cast should be unnecessary

Fixed. (Both instances.)

> nit random whitespace change

Restored.

Henri Sivonen (:hsivonen)

Assignee

Comment 165

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review264118

> Fixed. (Both instances.)

The cast is necessary for template dispatch.

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/51-52/

Comment hidden (mozreview-request)

Comment on attachment 8990224 [details]
Bug 1402247 - Big scroll-past parts.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/255258/diff/2-3/

Comment hidden (mozreview-request)

Comment on attachment 8990225 [details]
Bug 1402247 - Simple caller updates.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/255260/diff/2-3/

Comment hidden (mozreview-request)

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/213974/diff/52-53/

Henri Sivonen (:hsivonen)

Assignee

Comment 170

•

6 years ago

Thank you for the r+.

Try run after rebasing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a68b7633b85b194eb87dadab0cc9b34b89c4be72

Henri Sivonen (:hsivonen)

Assignee

Comment 171

•

6 years ago

The failure to make use of the slop in the initial ASCII-sized allocation when converting *to* UTF-8 is fixed in bug 1482093.

Eric Rahm [:erahm]

Comment 172

•

6 years ago

mozreview-review

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review269118

Phew! That was one lengthy review :) I *think* everything looks pretty good. I really appreciate the thorough comments and expanded tests. I skimmed the rust parts and am deferring to Nika on that. I have some relatively minor comments on the C++ side that I'd like to see addressed before landing.

One part that I'm still a little sketchy on is the change to MutatePrep and SetCapacity, those are a bit hard to validate via code review. For example it would have been nice to split out parts that were just perf tweaks to a separate bug / patch (the realloc stuff, the shrinking optimizations). Given the amount of time this has waited to land I think it's okay to leave those as-is for now.

::: js/xpconnect/src/XPCConvert.cpp:341
(Diff revision 53)
> +
> +        // For its internal simplicity, ConvertUTF8toUTF16 requires the
> +        // destination to be one code unit longer than the source, but
> +        // it never actually writes more code units than the number of
> +        // code units in the source. That's why it's OK to claim the
> +        // output buffer has len + 1 space but then still expect to

`allocLen` already has the `+ 1` builtin right? It might be clearer to use a separate var for `allocSize` and just use `allocLen` here.

::: js/xpconnect/src/XPCConvert.cpp:351
(Diff revision 53)
> +        buffer[written] = 0;
> +
> +        // JS_NewUCStringDontDeflate takes ownership on success, i.e. a
>          // successful call will make it the responsiblity of the JS VM
>          // to free the buffer.
> -        JSString* str = JS_NewUCString(cx, buffer, len);
> +        // written can never exceed len + 1, so the truncation is OK.

What truncation? The one above?

::: servo/support/gecko/nsstring/src/conversions.rs:50
(Diff revision 53)
> +#[inline(always)]
> +fn starts_with_ascii(buffer: &[u8]) -> bool {
> +    // We examine data only up to the end of the cache line
> +    // to make this check minimally disruptive.
> +    let bound = if buffer.len() <= CACHE_LINE {
> +        buffer.len()

This seems to assume the buffer is allocated on a cache line? For example if buffer.len() == 64 and buffer.as_ptr() = 0x001233F we'd blow the cache line. I don't think our allocators would allow that specific case, but for in-line buffers it could be possible.

::: servo/support/gecko/nsstring/src/conversions.rs:65
(Diff revision 53)
> +    // to look at more than one cache line in the UTF-16 case, because looking
> +    // at just one cache line wouldn't catch non-ASCII Latin with high enough
> +    // probability with Latin-script languages that have relatively infrequent
> +    // non-ASCII characters.
> +    let bound = if buffer.len() <= CACHE_LINE {
> +        buffer.len()

This has the same issue as above.

At first glance this looks wrong: we're okay with up to 2 cache lines but comparing against one. This works b/c `sizeof(u16) == 2` and `len` is the number of elements not byte size. Is there a rusty way to compare the byte size of the array to be a little more explicit?

::: xpcom/string/nsReadableUtils.h:21
(Diff revision 53)
>  #include "mozilla/Assertions.h"
>  #include "nsAString.h"
>  
>  #include "nsTArrayForwardDeclare.h"
>  
>  // Can't include mozilla/Encoding.h here

It would be helpful to add a note here that these are bindings for `encoding_rs` (or whatever wrapper lib) and a pointer to the docs/source.

::: xpcom/string/nsReadableUtils.h:119
(Diff revision 53)
> + * architecture and must not be relied upon.
> + *
> + * The length of aDest must be not be less than the length of aSource.
> + */
> +inline void
> +LossyConvertUTF16toLatin1(mozilla::Span<const char16_t> aSource,

Just curious on the span usage here: is there a reason we pass-by-value rather than const ref? Is this just the canonical way to use spans?

::: xpcom/string/nsReadableUtils.h:365
(Diff revision 53)
>  
>  /**
>   * Returns a new |char| buffer containing a zero-terminated copy of |aSource|.
>   *
>   * Allocates and returns a new |char| buffer which you must free with |free|.
> - * Performs a lossy encoding conversion by chopping 16-bit wide characters down to 8-bits wide while copying |aSource| to your new buffer.
> + * Performs a conversion with LossyConvertUTF16toLatin1() writing into the

Okay, so this changes the behavior to assert in debug builds. I'm not sure if we want that or not, perhaps it was discussed previously in the reviews?

::: xpcom/string/nsReadableUtils.h:571
(Diff revision 53)
> +  if (length < 16) {
> +    for (size_t i = 0; i < length; i++) {
> +      if (ptr[i] >= 0x80U) {
> +        ptr += i;
> +        length -= i;
> +        goto end;

I think this warrants a comment. Is the deal that `encoding_rs` knows how to check if it's Latin1 and can handle chars > 0x7F?

::: xpcom/string/nsReadableUtils.cpp
(Diff revision 53)
> -{
> -  if (aSource) {
> -    AppendUTF8toUTF16(nsDependentCString(aSource), aDest);
> -  }
> -}
> -

\o/

::: xpcom/string/nsReadableUtils.cpp:69
(Diff revision 53)
> +  destLen += 2;
> +  if (!destLen.isValid()) {
> +    return nullptr;
>    }
> -
> -  char* result = static_cast<char*>
> +  size_t destLenVal = destLen.value();
> +  if (destLenVal > UINT32_MAX) {

Why do we need this check? If it's necessary you could use `CheckedInt<uint32_t> destLen` instead and it'd cover this.

::: xpcom/string/nsReadableUtils.cpp:664
(Diff revision 53)
> -
> -    return 1;
> +      return 1;
> -  }
> +    }
> -
> -  if (u16 != u16end) {
> -    // We get to the end of the UTF8 string, but no to the end of
> +    // No need for ASCII optimization, since both NextChar()
> +    // calls get inlined.
> +    uint32_t scalar8 = UTF8CharEnumerator::NextChar(&u8, u8end, aErr);

Is `aErr` only set if there's an error? Or does it get cleared on success? If the latter then we need to take that into account.

::: xpcom/string/nsTSubstring.h:908
(Diff revision 53)
> +public:
>    /**
> -   * this function prepares mData to be mutated.
> +   * Prepares mData to be mutated such that the capacity of the string
> +   * (not counting the zero-terminator) is at least aCapacity.
> +   * Returns the actual capacity, which may be larger than what was
> +   * requested or UINT32_MAX on allocation failure.

I'm not a huge fan of using UINT32_MAX for an error code, do you think using a `Result<...>` would be more clear? If there's some reason we can't (maybe rust interop), can we at least use a constant instead?

::: xpcom/string/nsTSubstring.cpp:146
(Diff revision 53)
>      }
>  
> -    MOZ_ASSERT(XPCOM_MIN(temp, kMaxCapacity) >= aCapacity,
> +    newCapacity = XPCOM_MIN(temp, kMaxCapacity);
> +    MOZ_ASSERT(newCapacity >= aCapacity,
>                 "should have hit the early return at the top");
> -    aCapacity = XPCOM_MIN(temp, kMaxCapacity);
> +    // Avoid shinking if new buffer within 300 of the old. Note that

nit: 'shrinking'

Why did you choose 300? Please add a comment and make this a constant.

::: xpcom/string/nsTSubstring.cpp:147
(Diff revision 53)
>  
> -    MOZ_ASSERT(XPCOM_MIN(temp, kMaxCapacity) >= aCapacity,
> +    newCapacity = XPCOM_MIN(temp, kMaxCapacity);
> +    MOZ_ASSERT(newCapacity >= aCapacity,
>                 "should have hit the early return at the top");
> -    aCapacity = XPCOM_MIN(temp, kMaxCapacity);
> -  }
> +    // Avoid shinking if new buffer within 300 of the old. Note that
> +    // signed underflow is defined behavior.

Do you mean 'unsigned underflow'?

::: xpcom/string/nsTSubstring.cpp:155
(Diff revision 53)
> -  // several cases:
> -  //
> -  //  (1) we have a refcounted shareable buffer (this->mDataFlags &
> -  //      DataFlags::REFCOUNTED)
> -  //  (2) we have an owned buffer (this->mDataFlags & DataFlags::OWNED)
> -  //  (3) we have an inline buffer (this->mDataFlags & DataFlags::INLINE)
> +      MOZ_ASSERT(aAllowShrinking, "How come we didn't return earlier?");
> +      // We're already close enough to the right size.
> +      newData = oldData;
> +    } else {
> +      size_type storageSize = (newCapacity + 1) * sizeof(char_type);
> +      // Since we allocate only if we need a different jemalloc bucket

Maybe rephrase this, "Since we allocate only by powers of 2 we always fit into a full mozjemalloc bucket" or something like that. It wasn't initially clear how we were enforcing the sizes.

::: xpcom/string/nsTSubstring.cpp:773
(Diff revision 53)
>    }
> -
> -  // always null-terminate here, even if the buffer got longer.  this is
> -  // for backwards compat with the old string implementation.
> +  if (capacity) {
> +    // In the zero case StartBulkWrite already put the string
> +    // in a valid state.
> +
> +    // Otherwise, instead of calling FinishBulkWrite,

What's supposed to call `FinishBulkWrite` then?

::: xpcom/string/nsUTF8Utils.h:85
(Diff revision 53)
> - * Extract the next UCS-4 character from the buffer and return it.  The
> - * pointer passed in is advanced to the start of the next character in the
> - * buffer.  If non-null, the err parameter is filled in if an error occurs.
>   *
> - * If an error occurs that causes UCS2_REPLACEMENT_CHAR to be returned, then
> - * the buffer will be updated to move only a single UCS-2 character.
> + * Note: This method never sets *aErr to false to allow error accumulation
> + * across multiple calls.

I think this answers my previous question re: error accumulation.

::: xpcom/tests/gtest/TestStrings.cpp:1297
(Diff revision 53)
> +  s.SetCapacity(100);
> +  const char16_t* ptr = s.BeginReading();
> +  EXPECT_NE(origPtr, ptr);
> +  for (int i = 0; i < 100; i++) {
> +    s.Append(u'a');
> +    EXPECT_EQ(s.BeginReading(), ptr);

Might as well assert the length as well

::: xpcom/tests/gtest/TestStrings.cpp:1301
(Diff revision 53)
> +    s.Append(u'a');
> +    EXPECT_EQ(s.BeginReading(), ptr);
> +  }
> +}
> +
> +TEST_F(Strings, append_string_with_capacity)

Can you add a test for the old-style capacity behavior you were attempting to preserve? I think it's something like the following right?

```
nsString foo;
foo.SetCapacity(100);
memcpy(foo.BeginWriting(), buffer, bufferSize);
foo.SetLength(bufferSize);
```

Attachment #8943599 - Flags: review-

Eric Rahm [:erahm]

Comment 173

•

6 years ago

mozreview-review

Comment on attachment 8990225 [details]
Bug 1402247 - Simple caller updates.

https://reviewboard.mozilla.org/r/255260/#review269176

Just a note: I'm fine with Nika's r+ on this.

Eric Rahm [:erahm]

Comment 174

•

6 years ago

mozreview-review

Comment on attachment 8990224 [details]
Bug 1402247 - Big scroll-past parts.

https://reviewboard.mozilla.org/r/255258/#review269178

I'm fine with Nika's r+ on this, but can you give it a better title? Or are you planning on squashing before landing?

Eric Rahm [:erahm]

Updated

•

6 years ago

Flags: needinfo?(erahm)

Henri Sivonen (:hsivonen)

Assignee

Comment 175

•

6 years ago

mozreview-review-reply

Comment on attachment 8943599 [details]
Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions.

https://reviewboard.mozilla.org/r/213974/#review269118

Thank you! Let's see what the try run about asserting on in debug builds on the input of lossy conversions says.

> What truncation? The one above?

Removed the comment line about truncation. I thought the line below the comment truncated `size_t` to `uint32_t`, but either I have been imagining things or the JS API changed to `size_t` during the development of this patch.

> This seems to assume the buffer is allocated on a cache line? For example if buffer.len() == 64 and buffer.as_ptr() = 0x001233F we'd blow the cache line. I don't think our allocators would allow that specific case, but for in-line buffers it could be possible.

It's not meant to assume that the buffer is cache line-aligned. (The buffer part of `nsStringBuffer` isn't.) It's just meant to avoid the masking stuff for short strings. This becomes better in the already-reviewed bug 1473337, which I didn't fold into this one in order to stop the churn in this bug.

> This has the same issue as above.
> 
> At first glance this looks wrong: we're okay with up to 2 cache lines but comparing against one. This works b/c `sizeof(u16) == 2` and `len` is the number of elements not byte size. Is there a rusty way to compare the byte size of the array to be a little more explicit?

I believe there is no safe-Rust way to compare to byte size, so the explicitness is in the comment only.

> It would be helpful to add a note here that these are bindings for `encoding_rs` (or whatever wrapper lib) and a pointer to the docs/source.

Added such note.

> Just curious on the span usage here: is there a reason we pass-by-value rather than const ref? Is this just the canonical way to use spans?

This is the canonical way to use Span. I believe, but am not 100% sure, that Microsoft's design rationale was ensuring that a span argument is represented the same way as a pointer argument followed by a length argument in the ABI. (I haven't verified this representation, either.)

> Okay, so this changes the behavior to assert in debug builds. I'm not sure if we want that or not, perhaps it was discussed previously in the reviews?

It wasn't discussed in previous review. Also, the comment isn't true until encoding_rs 0.8.5. (Oops!) Here's a try run with encoding_rs 0.8.5 to check if it's feasible: 
https://treeherder.mozilla.org/#/jobs?repo=try&revision=27f0b372c7455b072fd2eb86c62b5233f3a2e054

If it is feasible, I think it makes sense to assert in debug builds, because we should get rid of calls that result in garbage output.

> I think this warrants a comment. Is the deal that `encoding_rs` knows how to check if it's Latin1 and can handle chars > 0x7F?

Yes, the idea is to deal only with ASCII in the inline C++ part and the Rust code knows how to check non-ASCII. Added comment.

> Why do we need this check? If it's necessary you could use `CheckedInt<uint32_t> destLen` instead and it'd cover this.

Otherwise `*aUTF8Count` could overflow. Changed to `CheckedInt<uint32_t>`.

> Is `aErr` only set if there's an error? Or does it get cleared on success? If the latter then we need to take that into account.

Not cleared on success. This is documented in `nsUTF8Utils.h`.

> I'm not a huge fan of using UINT32_MAX for an error code, do you think using a `Result<...>` would be more clear? If there's some reason we can't (maybe rust interop), can we at least use a constant instead?

Used `Result` in C++ but still `UINT32_MAX` as a magic value when crossing the FFI.

> nit: 'shrinking'
> 
> Why did you choose 300? Please add a comment and make this a constant.

Changed to 384 as the exact midpoint between 256 and 512 and explained why a number between 256 and 512 is used.

> Do you mean 'unsigned underflow'?

Yes. Fixed the comment.

> Maybe rephrase this, "Since we allocate only by powers of 2 we always fit into a full mozjemalloc bucket" or something like that. It wasn't initially clear how we were enforcing the sizes.

Used the suggested text.

> What's supposed to call `FinishBulkWrite` then?

Nothing, because the code below the comment implements a legacy semantic-preserving alternative to `FinishBulkWrite`.

> Might as well assert the length as well

Added assertion. (Also in the next test.)

> Can you add a test for the old-style capacity behavior you were attempting to preserve? I think it's something like the following right?
> 
> ```
> nsString foo;
> foo.SetCapacity(100);
> memcpy(foo.BeginWriting(), buffer, bufferSize);
> foo.SetLength(bufferSize);
> ```

Added.

Henri Sivonen (:hsivonen)

Assignee

Comment 176

•

6 years ago

mozreview-review-reply

Comment on attachment 8990224 [details]
Bug 1402247 - Big scroll-past parts.

https://reviewboard.mozilla.org/r/255258/#review269178

The plan is to squash, yes. Otherwise, I'd leave non-compiling intermediate steps in the tree, which would be bad for future bisecting.

Henri Sivonen (:hsivonen)

Assignee

Comment 177

•

6 years ago

Attached patch erahm's comments addressed (except for debug asserting Latin1ness) (obsolete) — Details — Splinter Review

Pushing to MozReview fails even though I thought it was supposed to work for another week. Here's the main patch with erahm's comments addressed (except for the debug assert on Latin1ness).

Henri Sivonen (:hsivonen)

Assignee

Comment 178

•

6 years ago

Try with an attempt to replace lossy conversions to Latin1 with out-of-range input with conversions to UTF-8:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5bf11a4d1922dd25e15391f0551f926b21ff6d9

Henri Sivonen (:hsivonen)

Assignee

Comment 179

•

6 years ago

Attached patch Patch with erahm's comments addressed (obsolete) — Details — Splinter Review

Attachment #8999144 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Comment 180

•

6 years ago

Attached patch Interdiff between the last MozReview patch and the Splinter patch (obsolete) — Details — Splinter Review

Henri Sivonen (:hsivonen)

Assignee

Comment 181

•

6 years ago

Remove one assertion still:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eba9a052082b12c3561425969fbf64d594838d8a

Henri Sivonen (:hsivonen)

Assignee

Comment 182

•

6 years ago

Attached patch Patch with erahm's comments addressed, with less SRI brokenness (obsolete) — Details — Splinter Review

Attachment #8999174 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Comment 183

•

6 years ago

Attached patch Interdiff between the last MozReview patch and the Splinter patch (v2) (obsolete) — Details — Splinter Review

Attachment #8999175 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Comment 184

•

6 years ago

Comment on attachment 8999204 [details] [diff] [review]
Patch with erahm's comments addressed, with less SRI brokenness

Probably makes more sense to look at the interdiff instead of this attachment...

The changes required to run our test suite with debug assert inside encoding_rs asserting that conversions to Latin1 have the input in range turned out to be minor. Note how the SRI stuff already using UTF-8 to UTF-16 conversion when it converts back to UTF-16 (not included in the diff context).

Attachment #8999204 - Flags: review?(erahm)

Eric Rahm [:erahm]

Comment 185

•

6 years ago

Comment on attachment 8999204 [details] [diff] [review]
Patch with erahm's comments addressed, with less SRI brokenness

Review of attachment 8999204 [details] [diff] [review]:
-----------------------------------------------------------------

Nathan, I'd like a second (or third) set of eyes on the final changes to MutatePrep, ReplacePrep, and SetCapacity in nsTString.(h|cpp). We could maybe punt to dbaron if you prefer.

Attachment #8999204 - Flags: review?(nfroyd)

Eric Rahm [:erahm]

Comment 186

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #184)
> Comment on attachment 8999204 [details] [diff] [review]
> Patch with erahm's comments addressed, with less SRI brokenness
> 
> Probably makes more sense to look at the interdiff instead of this
> attachment...
> 
> The changes required to run our test suite with debug assert inside
> encoding_rs asserting that conversions to Latin1 have the input in range
> turned out to be minor. Note how the SRI stuff already using UTF-8 to UTF-16
> conversion when it converts back to UTF-16 (not included in the diff
> context).

Judging by your recent changes we ran into some assertions, my concern is that we'll hit a bunch more outside of automation and make debug builds basically unusable. I'd prefer if we maintained the behavior that lossy conversion doesn't assert and then in a follow up just get rid of all uses of it that we can. Otherwise I think we need to hold off on landing this until after the next release is cut so that we have a full cycle to filter through crashes and don't have to worry about beta uplifts.

Henri, what do you think? How adamant are you about asserting in debug builds?

Flags: needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Comment 187

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #186)
> (In reply to Henri Sivonen (:hsivonen) from comment #184)
> > Comment on attachment 8999204 [details] [diff] [review]
> > Patch with erahm's comments addressed, with less SRI brokenness
> > 
> > Probably makes more sense to look at the interdiff instead of this
> > attachment...
> > 
> > The changes required to run our test suite with debug assert inside
> > encoding_rs asserting that conversions to Latin1 have the input in range
> > turned out to be minor. Note how the SRI stuff already using UTF-8 to UTF-16
> > conversion when it converts back to UTF-16 (not included in the diff
> > context).
> 
> Judging by your recent changes we ran into some assertions, my concern is
> that we'll hit a bunch more outside of automation and make debug builds
> basically unusable.

The remaining probable cases seem to be in optional logging code. It doesn't seem too horrible to fix those as someone actually turns logging on, but in order to get this landed, I'm willing to leave the assertions out for now and proactively fix the latent logging code.

> I'd prefer if we maintained the behavior that lossy
> conversion doesn't assert and then in a follow up just get rid of all uses
> of it that we can. Otherwise I think we need to hold off on landing this
> until after the next release is cut so that we have a full cycle to filter
> through crashes and don't have to worry about beta uplifts.

Note that the assertion isn't in this patch but in bug 1482095. I'd rather land this (adjusting the comment that claims there are assertions) and release encoding_rs 0.8.6 without the assertions (to unblock bug  bug 1482093) than delay landing this patch. Can this get r+ on that condition?

> Henri, what do you think? How adamant are you about asserting in debug
> builds?

I don't want to maintain two assert behaviors in encoding_rs and I feel pretty strongly that we should (if not right now, pretty soon) treat attempts to pass out-of-range input to conversions to Latin1 as bugs--i.e. assert in debug builds.

Flags: needinfo?(hsivonen)

Eric Rahm [:erahm]

Comment 188

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #187)

> Note that the assertion isn't in this patch but in bug 1482095. I'd rather
> land this (adjusting the comment that claims there are assertions) and
> release encoding_rs 0.8.6 without the assertions (to unblock bug  bug
> 1482093) than delay landing this patch. Can this get r+ on that condition?

That sounds reasonable.

Eric Rahm [:erahm]

Comment 189

•

6 years ago

Comment on attachment 8999205 [details] [diff] [review]
Interdiff between the last MozReview patch and the Splinter patch (v2)

Review of attachment 8999205 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this looks good (pending the assertion changes we discussed).

::: b/xpcom/string/nsTSubstring.cpp
@@ +27,5 @@
> +// and 512. Since the capacities have various overheads, we
> +// can't compare with 256 or 512 exactly but it's easier to
> +// compare to some number that's between the two, so it's
> +// far away from either to ignore the overheads.
> +#define NSSTRING_BUFFER_SHRINKING_THRESHOLD 384

nit: please make this a kConstant, rather than a #define.

Thanks for the thorough comment.

@@ +261,2 @@
>      aNewLen, aCutStart, true, suffixLength, oldSuffixStart, newSuffixStart);
> +  if (r.isErr()) {

This is much clearer, thanks for updating.

Eric Rahm [:erahm]

Comment 190

•

6 years ago

Comment on attachment 8999204 [details] [diff] [review]
Patch with erahm's comments addressed, with less SRI brokenness

Review of attachment 8999204 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the updates, commented on the interdiff. r=me w/ a minor update.

Attachment #8999204 - Flags: review?(erahm) → review+

Henri Sivonen (:hsivonen)

Assignee

Comment 191

•

6 years ago

Attached patch Patch with even more erahm's comments addressed (obsolete) — Details — Splinter Review

Attachment #8999204 - Attachment is obsolete: true

Attachment #8999204 - Flags: review?(nfroyd)

Attachment #8999510 - Flags: review?(nfroyd)

Henri Sivonen (:hsivonen)

Assignee

Comment 192

•

6 years ago

Attached patch Interdiff between the last MozReview patch and the Splinter patch (v3) (obsolete) — Details — Splinter Review

Attachment #8999205 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Comment 193

•

6 years ago

(In reply to Eric Rahm [:erahm] from comment #189)
> Comment on attachment 8999205 [details] [diff] [review]
> Interdiff between the last MozReview patch and the Splinter patch (v2)
> 
> Review of attachment 8999205 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks, this looks good (pending the assertion changes we discussed).

Adjusted comments for the UTF-16 to Latin1 case. (UTF-8 to Latin1 is new, so it might as well assert from day 1.)

> nit: please make this a kConstant, rather than a #define.

Done.

Henri Sivonen (:hsivonen)

Assignee

Comment 194

•

6 years ago

Attached patch Patch with erahm's comments addressed and aAllowShrinking=false in ReplacePrepInternal — Details — Splinter Review

One change relative to the previous patch: setting aAllowShrinking to false when calling StartBulkWrite from ReplacePrepInternal, because otherwise SetCapacity() before a series of Append() would be useless.

Sorry about having the wrong boolean there earlier.

Attachment #8999510 - Attachment is obsolete: true

Attachment #8999510 - Flags: review?(nfroyd)

Attachment #8999547 - Flags: review?(nfroyd)

Henri Sivonen (:hsivonen)

Assignee

Comment 195

•

6 years ago

Attached patch Interdiff between the last MozReview patch and the Splinter patch (v4) — Details — Splinter Review

Attachment #8999511 - Attachment is obsolete: true

Nathan Froyd [:froydnj]

Comment 196

•

6 years ago

Comment on attachment 8999547 [details] [diff] [review]
Patch with erahm's comments addressed and aAllowShrinking=false in ReplacePrepInternal

Review of attachment 8999547 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with changes based on the below comments and any ensuing discussion you would like to have.

::: servo/support/gecko/nsstring/src/lib.rs
@@ +1224,5 @@
> +    fn Gecko_StartBulkWriteCString(
> +        this: *mut nsACString,
> +        capacity: u32,
> +        units_to_preserve: u32,
> +        allow_shrinking: bool,

IIUC, this function doesn't actually accept an allow_shrinking parameter, so this should be removed.

@@ +1243,5 @@
> +    fn Gecko_StartBulkWriteString(
> +        this: *mut nsAString,
> +        capacity: u32,
> +        units_to_preserve: u32,
> +        allow_shrinking: bool,

Likewise here.

::: xpcom/string/nsTSubstring.cpp
@@ +87,5 @@
> +  // string.
> +  if (MOZ_UNLIKELY(!aCapacity)) {
> +    ::ReleaseData(this->mData, this->mDataFlags);
> +    SetToEmptyBuffer();
> +    this->mDataFlags &= ~DataFlags::VOIDED; // mutation clears voided flag

Why are we doing this when SetToEmptyBuffer only sets TERMINATED here?  Were there changes to SetToEmptyBuffer someplace else?

@@ +163,2 @@
>                 "should have hit the early return at the top");
> +    // Avoid shinking if new buffer size is close to the old. Note that

Nit: "shrinking" and "the new buffer size"

@@ +789,5 @@
> +    // in a valid state.
> +
> +    // Otherwise, instead of calling FinishBulkWrite,
> +    // intentionally leave the string in the weird state
> +    // required by the legacy semantics of this method.

Can we spell out the legacy semantics somewhere, so people don't have to reverse-engineer them from this comment and/or the discussion in this bug?  Or is the intention here that the comment about "existing callers", above, spells out those semantics?  If so, can we make that connection explicit?

::: xpcom/string/nsTSubstring.h
@@ +901,4 @@
>     */
>    void NS_FASTCALL Finalize();
>  
> +public:

The public-ness of this method seems really unfortunate for what is a fairly unwieldy API.  Is it possible to keep this `protected` and instead judiciously use `friend` to let the Rust bindings see these functions?  Or, as a compromise position, should the full API of StartBulkWrite be protected (or even private), and a simpler (aCapacity, aPrefixToPreserve) API be publically exposed for everybody else, since that's what non-nsTSubstring users seem to want?  (I think I like the `friend` approach more, but am trying to think through options here.)

@@ +924,4 @@
>     *
> +   * Once this method has been called and before FinishBulkWrite()
> +   * has been called, only calls to Data() or this method again
> +   * are valid. Do not call any other methods between calling this

Is it clearer to say here that "...has been called, the only methods on the string that are safe to call are Data() or this method."?  Should something be said about direct access to members, as well, since the code seems to assume certain members are safe to access?  Or should those pieces of code be converted to use accessors?

@@ +931,5 @@
> +   *                  will be greater than or equal to this value.
> +   * @param aPrefixToPreserve The number of code units at the start
> +   *                          of the old buffer to copy into the
> +   *                          new buffer.
> +   * @parem aAllowShrinking If true, an allocation may be performed

Using an actual enum instead of a bool here seems much better.  If this parameter was actually used from Rust, we might be able to make it a bool to avoid potential efficiency issues with Rust bool -> C enum conversion.  But given that we don't, let's make this type-safe.

Attachment #8999547 - Flags: review?(nfroyd) → review+

Henri Sivonen (:hsivonen)

Assignee

Comment 197

•

6 years ago

(In reply to Nathan Froyd [:froydnj] from comment #196)
> Comment on attachment 8999547 [details] [diff] [review]
> Patch with erahm's comments addressed and aAllowShrinking=false in
> ReplacePrepInternal
> 
> Review of attachment 8999547 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r=me with changes based on the below comments and any ensuing discussion you
> would like to have.
> 
> ::: servo/support/gecko/nsstring/src/lib.rs
> @@ +1224,5 @@
> > +    fn Gecko_StartBulkWriteCString(
> > +        this: *mut nsACString,
> > +        capacity: u32,
> > +        units_to_preserve: u32,
> > +        allow_shrinking: bool,
> 
> IIUC, this function doesn't actually accept an allow_shrinking parameter, so
> this should be removed.
> 
> @@ +1243,5 @@
> > +    fn Gecko_StartBulkWriteString(
> > +        this: *mut nsAString,
> > +        capacity: u32,
> > +        units_to_preserve: u32,
> > +        allow_shrinking: bool,
> 
> Likewise here.

Whoa! Good catch! I added the corresponding argument on the C++ side.
 
> ::: xpcom/string/nsTSubstring.cpp
> @@ +87,5 @@
> > +  // string.
> > +  if (MOZ_UNLIKELY(!aCapacity)) {
> > +    ::ReleaseData(this->mData, this->mDataFlags);
> > +    SetToEmptyBuffer();
> > +    this->mDataFlags &= ~DataFlags::VOIDED; // mutation clears voided flag
> 
> Why are we doing this when SetToEmptyBuffer only sets TERMINATED here?  Were
> there changes to SetToEmptyBuffer someplace else?

You are right that this line is unnecessary. Removed.

> @@ +163,2 @@
> >                 "should have hit the early return at the top");
> > +    // Avoid shinking if new buffer size is close to the old. Note that
> 
> Nit: "shrinking" and "the new buffer size"

Fixed.

> @@ +789,5 @@
> > +    // in a valid state.
> > +
> > +    // Otherwise, instead of calling FinishBulkWrite,
> > +    // intentionally leave the string in the weird state
> > +    // required by the legacy semantics of this method.
> 
> Can we spell out the legacy semantics somewhere, so people don't have to
> reverse-engineer them from this comment and/or the discussion in this bug? 
> Or is the intention here that the comment about "existing callers", above,
> spells out those semantics?  If so, can we make that connection explicit?

Adjusted comment. Note that the previous implementation had a comment blaming some of the weirdness on legacy compat, and I don't know what callers that part of the behavior is for...

> ::: xpcom/string/nsTSubstring.h
> @@ +901,4 @@
> >     */
> >    void NS_FASTCALL Finalize();
> >  
> > +public:
> 
> The public-ness of this method seems really unfortunate for what is a fairly
> unwieldy API.  Is it possible to keep this `protected` and instead
> judiciously use `friend` to let the Rust bindings see these functions?  Or,
> as a compromise position, should the full API of StartBulkWrite be protected
> (or even private), and a simpler (aCapacity, aPrefixToPreserve) API be
> publically exposed for everybody else, since that's what non-nsTSubstring
> users seem to want?  (I think I like the `friend` approach more, but am
> trying to think through options here.)

I made FinishBulkWrite() protected and added an all-caps comment to StartBulkWrite() saying it shouldn't be called from outside the string implementation. Using friend didn't work, because extern functions or static functions can't be declared as friends. Bug 1482828 is about adding a proper public API.

> @@ +924,4 @@
> >     *
> > +   * Once this method has been called and before FinishBulkWrite()
> > +   * has been called, only calls to Data() or this method again
> > +   * are valid. Do not call any other methods between calling this
> 
> Is it clearer to say here that "...has been called, the only methods on the
> string that are safe to call are Data() or this method."?  Should something
> be said about direct access to members, as well, since the code seems to
> assume certain members are safe to access?  Or should those pieces of code
> be converted to use accessors?

Changed Data() to mData.

> @@ +931,5 @@
> > +   *                  will be greater than or equal to this value.
> > +   * @param aPrefixToPreserve The number of code units at the start
> > +   *                          of the old buffer to copy into the
> > +   *                          new buffer.
> > +   * @parem aAllowShrinking If true, an allocation may be performed
> 
> Using an actual enum instead of a bool here seems much better.  If this
> parameter was actually used from Rust, we might be able to make it a bool to
> avoid potential efficiency issues with Rust bool -> C enum conversion.  But
> given that we don't, let's make this type-safe.

It is now actually used from Rust.

Henri Sivonen (:hsivonen)

Assignee

Comment 198

•

6 years ago

Try run after the changes:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=29cd0e8957766434bb93114251914b20f710c99a

Henri Sivonen (:hsivonen)

Assignee

Comment 199

•

6 years ago

Attached patch Patch with froydnj's review comments addressed — Details — Splinter Review

Henri Sivonen (:hsivonen)

Assignee

Comment 200

•

6 years ago

Attached patch Interdiff between the patch addressing erahm's comments and the patch addressing froydnj's comments — Details — Splinter Review

Henri Sivonen (:hsivonen)

Assignee

Comment 201

•

6 years ago

Note to self: Once this and related bugs have landed, compare the platform microbenchmarks of the "after" situation with
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=8b39d1161075364a95bc2d1577b389411fe5c342 and
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=bb02d7d281cb13805772a9875c4f31074c152c7b
as the "before" reference.

Pulsebot

Comment 202

•

6 years ago

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4ef0f163fdeb
Use encoding_rs for XPCOM string encoding conversions. r=Nika,erahm,froydnj.

Andreea Pavel [:apavel]

Comment 203

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4ef0f163fdeb

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox63: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla63

Henri Sivonen (:hsivonen)

Assignee

Comment 204

•

6 years ago

Looks like tree did not burn.

Thank you Nika, erahm and froydnj for reviews!

Let's continue in follow-up bugs.

Calixte Denizet (:calixte)

Updated

•

6 years ago

Depends on: 1483603

Joel Maher ( :jmaher ) (UTC -8)

Comment 205

•

6 years ago

I see a small increase in the installer size from this bug:

== Change summary for alert #15036 (as of Wed, 15 Aug 2018 23:31:04 GMT) ==

Regressions:

  0%  installer size windows2012-64 pgo      68,340,584.00 -> 68,426,715.83

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=15036

Henri Sivonen (:hsivonen)

Assignee

Comment 206

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #205)
> I see a small increase in the installer size from this bug:
> 
> == Change summary for alert #15036 (as of Wed, 15 Aug 2018 23:31:04 GMT) ==
> 
> Regressions:
> 
>   0%  installer size windows2012-64 pgo      68,340,584.00 -> 68,426,715.83
> 
> For up to date results, see:
> https://treeherder.mozilla.org/perf.html#/alerts?id=15036

Filed bug 1483775. I hope the increase is considered small enough that it's OK to wait for LLVM ThinLTO between clang and rustc-generated code to land before taking action?

Depends on: 1483775

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Depends on: 1484045

Henri Sivonen (:hsivonen)

Assignee

Comment 207

•

6 years ago

Documenting the outcome now that the follow-up Rust patches have landed:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=8b39d1161075364a95bc2d1577b389411fe5c342&newProject=mozilla-central&newRevision=6cc53c0f6efec1c590a5b3a6c0019a26e11e6843&framework=6&filter=UTF

linux32 perf looks like it mostly regressed and badly. No clue why. Since linux32 is on the verge of not being tier1 and windows7-32 generally got better, I'm not too worried about the very weird linux32 numbers and I think the linux32 case is not worth investigating. Certainly not worth a backout.

The differences between linux64 and linux64-qr suggest the results are very noisy.

In some cases, the function call overhead got worse, which regressed very short strings (one or three in length), but in those cases the absolute time is tiny anyway.

In general, this was an improvement and the improvement tends to be in double-digit percentages, often around -30%, sometimes better than -50% and in the best case -90%.

Overall, seems like a worthwhile change.

Henri Sivonen (:hsivonen)

Assignee

Comment 208

•

6 years ago

Less noisy view that has the flaw that the early parts of the patch series are part of the historical data from the point of view of the last commit of the series:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=mozilla-central&newRevision=6cc53c0f6efec1c590a5b3a6c0019a26e11e6843&framework=6&filter=UTF&selectedTimeRange=7776000

The conclusions hold.

Henri Sivonen (:hsivonen)

Assignee

Comment 209

•

6 years ago

FWIW, the call overhead change seems particularly strong on Mac.

Henri Sivonen (:hsivonen)

Assignee

Comment 210

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #207)
> I think the linux32 case is not worth investigating.

It's still weird enough that I filed bug 1485557.

Henri Sivonen (:hsivonen)

Assignee

Comment 211

•

6 years ago

It turns out that the linux32 test run that I looked at previously was an unlucky outlier (way worse than the usual cluster of data points).

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=mozilla-central&newRevision=338526a05458&framework=6&filter=UTF&selectedTimeRange=7776000 looks OK for linux32, too.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Keywords: perf

David Burns :automatedtester

Updated

•

6 years ago

Blocks: 1495753

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

Depends on: 1519337

Henri Sivonen (:hsivonen)

Assignee

Updated

•

4 years ago

Regressions: 1605063

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: String → XPCOM

Bug 1402247 - Use encoding_rs for XPCOM string encoding conversions. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request	nika : review+ erahm : review-	Details
Bug 1402247 part 2 - Implement converting append/assign for XPCOM strings in Rust. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 3 - Call Rust-based append/assign code from nsReadableUtils. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 4 - Use encoding_rs::mem for string conversion that yield plain C strings. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 5 - Remove const char* / const char16_t versions of string encoding conversions. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 6 - Fix fatal bugs. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 7 - Use the new code from XPConnect. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 8 - Remove LossyConvertEncoding8to16 and LossyConvertEncoding16to8. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 9 - Remove copy_string(). 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 7b - Pass the right length to allocator. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 9b - Fix fatal bugs. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 10 - Make UTF8CharEnumerator::NextChar() consistent with UTF-8 to UTF-16 string conversion. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 part 11 - Use U+FFFD when atomizing invalid UTF-8. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 addendum - Unconditionally use unaligned loads and stores and unroll the loop by two when converting between Latin1 and UTF-16. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 bis - Rework string buffer allocation from Rust. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request		Details
Bug 1402247 - Big scroll-past parts. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request	nika : review+	Details
Bug 1402247 - Simple caller updates. 6 years ago Henri Sivonen (:hsivonen) 59 bytes, text/x-review-board-request	nika : review+	Details
erahm's comments addressed (except for debug asserting Latin1ness) 6 years ago Henri Sivonen (:hsivonen) 189.17 KB, patch		Details \| Diff \| Splinter Review
Patch with erahm's comments addressed 6 years ago Henri Sivonen (:hsivonen) 191.56 KB, patch		Details \| Diff \| Splinter Review
Interdiff between the last MozReview patch and the Splinter patch 6 years ago Henri Sivonen (:hsivonen) 14.53 KB, patch		Details \| Diff \| Splinter Review
Patch with erahm's comments addressed, with less SRI brokenness 6 years ago Henri Sivonen (:hsivonen) 191.84 KB, patch	erahm : review+	Details \| Diff \| Splinter Review
Interdiff between the last MozReview patch and the Splinter patch (v2) 6 years ago Henri Sivonen (:hsivonen) 14.82 KB, patch		Details \| Diff \| Splinter Review
Patch with even more erahm's comments addressed 6 years ago Henri Sivonen (:hsivonen) 191.92 KB, patch		Details \| Diff \| Splinter Review
Interdiff between the last MozReview patch and the Splinter patch (v3) 6 years ago Henri Sivonen (:hsivonen) 16.89 KB, patch		Details \| Diff \| Splinter Review
Patch with erahm's comments addressed and aAllowShrinking=false in ReplacePrepInternal 6 years ago Henri Sivonen (:hsivonen) 191.92 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
Interdiff between the last MozReview patch and the Splinter patch (v4) 6 years ago Henri Sivonen (:hsivonen) 16.97 KB, patch		Details \| Diff \| Splinter Review
Patch with froydnj's review comments addressed 6 years ago Henri Sivonen (:hsivonen) 192.45 KB, patch		Details \| Diff \| Splinter Review
Interdiff between the patch addressing erahm's comments and the patch addressing froydnj's comments 6 years ago Henri Sivonen (:hsivonen) 4.54 KB, patch		Details \| Diff \| Splinter Review