Closed Bug 1261841 (encoding_rs) Opened 8 years ago Closed 7 years ago

Replace uconv with encoding_rs

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox56 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

(Keywords: addon-compat)

Attachments

(6 files, 3 obsolete files)

4 months ago, I proposed rewriting uconv in Rust:
https://groups.google.com/d/msg/mozilla.dev.platform/sefrg5Of8tw/_WK7Vtk9AAAJ

There were some concerns about the Rust dependency in general but no objection to replacing uconv specifically. The Rust dependency is happening for other reasons anyway (starting with the MP4 parser).

I have started an implementation of the Encoding Standard in Rust at https://github.com/hsivonen/encoding-rs . (I believe that rust-encoding isn't suitable for Gecko for the reasons given in https://docs.google.com/document/d/13GCbdvKi83a77ZcKOxaEteXp1SOGZ_9Fmztb9iX22v0/edit# .)

While encoding-rs isn't complete yet (it's missing a bunch of encodings as well as performance optimizations), I'm now confident that once it is complete, it makes sense to replace uconv with it.

C++ code in Gecko would see an API along the lines of https://github.com/hsivonen/encoding-rs/blob/master/include/encoding_rs_cpp.h (with std and gsl things replaced with XPCOM and MFBT analogs.)

I will mark some existing bugs as depending on this one to indicate that replacing uconv with encoding-rs would fix those bugs or make them moot. This does not imply that those bugs couldn't be fixed by other means sooner.
Blocks: 336553, 504831
Depends on: 1163224
This should also depend on a bug that is probably not-yet-filed which should be called "Require Rust to build Firefox".
(In reply to Ted Mielczarek [:ted.mielczarek] (PTO July 9th-12th) from comment #1)
> This should also depend on a bug that is probably not-yet-filed which should
> be called "Require Rust to build Firefox".

Indeed. Dependency added.

Additionally, it's now called encoding_rs (as opposed to encoding-rs) to have the name be a valid crate name.

encoding_rs now has implementations for all the encodings and the API is very likely not to undergo incompatible changes. However, the performance optimizations, including the SSE2 one needed for optimization parity with our current UTF-8 decoder, haven't been implemented yet.
Depends on: 1284816
Summary: Replace uconv with encoding-rs → Replace uconv with encoding_rs
Alias: encoding_rs
Blocks: 1285398
Blocks: 1285396
Blocks: 1285399
Blocks: 1285400
Blocks: 1071816
Depends on: 1295611
Depends on: oxidation
Depends on: 1331859
No longer blocks: 1345207
Depends on: 1348272
Attached patch Backup of WIP (does not compile) (obsolete) — Splinter Review
Blocks: 1215860
Depends on: 943287
Attachment #8848489 - Attachment is obsolete: true
Depends on: 1353324
Depends on: 1354994
Depends on: 1359353
Attached patch Newer backup of WIP (obsolete) — Splinter Review
Attachment #8849989 - Attachment is obsolete: true
Depends on: 1359874
Attachment #8862360 - Attachment is obsolete: true
Depends on: 1360497
May I ask what the plan is here? Are you going to land all three parts at the same time or would it be possible to land the uconv removal - say - a week later so the TB team have the chance to switch over to the new interface without any "bustage fix" pressure. It's a little hard to switch over to a new interface before it lands and then being forced to switch over in no time. Looks like Part 3 consists of straight removals, so if I read it correctly, there would not be a problem with landing it a little later.
Blocks: 747762
(In reply to Jorg K (GMT+2) from comment #36)
> May I ask what the plan is here? Are you going to land all three parts at
> the same time

That's my intention.

> or would it be possible to land the uconv removal - say - a
> week later so the TB team have the chance to switch over to the new
> interface without any "bustage fix" pressure. It's a little hard to switch
> over to a new interface before it lands and then being forced to switch over
> in no time. Looks like Part 3 consists of straight removals, so if I read it
> correctly, there would not be a problem with landing it a little later.

Part three is just file removals, but uconv is made not to work in part 2 already. I just moved the full-file removals to a different changeset to make the most interesting changeset (the middle one) more reviewable.

It's worth noting that: 1) the uconv code needs to be removed form libxul at the same time the encoding_rs code goes in, because otherwise people watching Firefox (and especially Fennec) code size will sound an alert and 2) the changes to e.g. EncodingUtils would already break c-c even if uconv itself wasn't removed atomically.

As for preparing a c-c patch with more time than "no time", at some point the patch evolution here will slow down, so you could apply the patches to a local copy of m-c-as-used-by-c-c and develop the c-c patch.
Blocks: 1363281
(In reply to Henri Sivonen (:hsivonen) from comment #37)
> As for preparing a c-c patch with more time than "no time", at some point
> the patch evolution here will slow down, so you could apply the patches to a
> local copy of m-c-as-used-by-c-c and develop the c-c patch.
I tried making a start, but neither part 1 nor part 2 apply cleanly to current M-C trunk. While part 1 only complains about "Cargo.lock" files (which I don't think matter), part 2 has a lot of rejected hunks.

So Henri, please let me know when is a good time to start. I've filed bug 1363281 for the mailnews work.
(In reply to Jorg K (GMT+2) from comment #50)
> I tried making a start, but neither part 1 nor part 2 apply cleanly to
> current M-C trunk. While part 1 only complains about "Cargo.lock" files
> (which I don't think matter),

They do matter.

> part 2 has a lot of rejected hunks.

I intend to rebase after I get a green try run with my current reference point. I have refrained from rebasing in order to reduce the noise in orange reduction progress.

> So Henri, please let me know when is a good time to start.

I will.
emk, do you have opinions on who should review and how review should be timed relative to the remaining bits still needed? Are you comfortable reviewing Rust code?

I consider the architecture and API of encoding_rs done (part 1) and the integration between encoding_rs and Gecko done except for the build system and header location bits below (part 2). The removal of uconv (part 3) is done and trivial.

The list of the remaining work before landing is:

 * Properly connect build system configure flags to cargo features.

 * Gain agreement on dev-platform to make the SIMD cargo feature enablement configure flag also unlock nightly features in the Rust compiler and use that flag in mozconfigs for official Firefox builds and CI. (I.e. we'd use a release-channel Rust compiler that has been through the Rust release regressing fix process but also use one feature that Rust doesn't consider "done" enough for release-channel Rust.)

 * Upstream the copies of encoding_rs C headers from under intl/ to the encoding_c crate and use them directly from under third_party/rust/encoding_c/include/.

 * Make UTF-16 to UTF-8 encode of non-Latin content go faster on x86_64 (probably by moving the surrogate presence check to SSE2).

 * Make UTF-8 to UTF-16 decode and UTF-16 to UTF-8 encode go faster on ARMv7 (probably by accelerating ASCII handling [both directions] and surrogate checks [encoder only] using NEON).

Stuff that could happen after landing:

 * Benchmark which uses of "unsafe" for presumed performance gains in encoding_rs are useless and remove "unsafe" when useless.

 * Make UTF-16 decode faster using SSE2. (Now not optimized at all.)

 * Make x-user-defined decode faster using SSE2.

 * Update UTF-8 validation with a couple of changes made to the Rust standard library since copying code from there. (The reason for not just calling into the Rust standard library is the addition of SIMD acceleration. I hope it can be upstreamed some day.)

 * Recalibrate multi-threaded UTF-8 validation and decide if it's worthwhile to enable or if the code should be deleted.
Flags: needinfo?(VYV03354)
(In reply to Jorg K (GMT+2) from comment #50)
> So Henri, please let me know when is a good time to start. I've filed bug
> 1363281 for the mailnews work.

From my perspective, the API is now ready. Review may show otherwise, of course, so it's up to you if you want to take the risk of review resulting in API changes at this point.
> Stuff that could happen after landing:

Also, the new API allows for better BOM and EOF handling than the old API, but part 2 here doesn't try to make all uses of decoders in the tree do BOM and EOF handling better than the old code. I left that kind of incremental correctness to follow-ups.
(In reply to Henri Sivonen (:hsivonen) from comment #82)
> emk, do you have opinions on who should review and how review should be
> timed relative to the remaining bits still needed? Are you comfortable
> reviewing Rust code?

No. Maybe Xidorn can review Rust code?

I'll review the integration part (part 2).
Flags: needinfo?(VYV03354) → needinfo?(xidorn+moz)
Looks like the Rust code includes lots of interacting with nsstring, so maybe mystor could be a better reviewer here.

A general comment that, we probably want something like the build script of mp4parse binding to generate the C++ header file from the Rust code, rather than maintaining them separately.
Flags: needinfo?(xidorn+moz) → needinfo?(michael)
(In reply to Masatoshi Kimura [:emk] from comment #85)
> I'll review the integration part (part 2).

Thanks.

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #86)
> Looks like the Rust code includes lots of interacting with nsstring, so maybe mystor could be a better reviewer here.

I agree that it would be good if mystor reviewed the encoding_glue crate (in part 2).

What about the encoding_rs crate itself (part 1)? It's probably nicer to see the bulk of part 1 at https://github.com/hsivonen/encoding_rs than in the patch from. 

> A general comment that, we probably want something like the build script of
> mp4parse binding to generate the C++ header file from the Rust code, rather
> than maintaining them separately.

Using cheddar from a build.rs makes the development experience miserable, because building cheddar's dependencies is so slow. Additionally, per previous dev-platform discussion, we don't have a good way to run cheddar before the C++ build needs to see the headers.

For this reason, while the header for the encoding_c crate is cheddar-generated, the header generation is run manually and the result checked in.

For encoding_glue, I figured that it was so easy to write the header manually that I didn't bother to consider how to deal with cheddar for a crate that's never build in isolation but only as part of the gkrust dependency graph.
(In reply to Henri Sivonen (:hsivonen) from comment #87)
> What about the encoding_rs crate itself (part 1)? It's probably nicer to see
> the bulk of part 1 at https://github.com/hsivonen/encoding_rs than in the
> patch from. 

Oh you mean that crate... That would really need a nontrivial time, and expertise on encoding to review... I'm not sure who else has expertise on both encoding and Rust... Probably someone from Servo team?
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #88)
> (In reply to Henri Sivonen (:hsivonen) from comment #87)
> > What about the encoding_rs crate itself (part 1)? It's probably nicer to see
> > the bulk of part 1 at https://github.com/hsivonen/encoding_rs than in the
> > patch from. 
> 
> Oh you mean that crate... That would really need a nontrivial time, and
> expertise on encoding to review... I'm not sure who else has expertise on
> both encoding and Rust... Probably someone from Servo team?

Simon, would you be OK with reviewing the encoding_rs crate itself (at least to the point of rubber-stamping it for vendoring into m-c)?
Flags: needinfo?(simon.sapin)
> Simon, would you be OK with reviewing the encoding_rs crate itself (at least
> to the point of rubber-stamping it for vendoring into m-c)?

FWIW, encoding_rs is already clippy-clean (except for the exclusions seen in the source) and has already been cargo-fuzzed. (Checking decoder output for UTF-8 or UTF-16 validity and checking that API promises made about buffer sizes hold. All the fuzzer-found bugs were in the latter category.)
Ted, what am I doing wrong in https://hg.mozilla.org/try/diff/54c7dfc6133b/toolkit/moz.configure that makes the "if" conditions I added to https://hg.mozilla.org/try/diff/54c7dfc6133b/toolkit/library/rust/gkrust-features.mozbuild always evaluate to true (if uncommented)?
Flags: needinfo?(ted)
(In reply to Henri Sivonen (:hsivonen) from comment #83)
> From my perspective, the API is now ready. Review may show otherwise, of
> course, so it's up to you if you want to take the risk of review resulting
> in API changes at this point.
Part 2 doesn't apply to current trunk at time of writing. Against which rev can I apply this? Or do you want to keep rebasing this. I get a failed hunk in intl/moz.build.
I’m happy to rubber-stamp encoding_rs for inclusion in m-c. I’ve read some of its code, and written some code using it. I like the general design. If a more detailed review through every line of code is desired, I can start on that after I get back from PTO on 2017-06-05. However, for what it’s worth, many crates have already been vendored in m-c as dependencies of Stylo or Webrender with much less review than that.
Flags: needinfo?(simon.sapin)
(In reply to Simon Sapin (:SimonSapin) from comment #93)
> I’m happy to rubber-stamp encoding_rs for inclusion in m-c. I’ve read some
> of its code, and written some code using it. I like the general design.

Nice. Thank you.

(In reply to Jorg K (GMT+2) from comment #92)
> Part 2 doesn't apply to current trunk at time of writing. Against which rev
> can I apply this? Or do you want to keep rebasing this. I get a failed hunk
> in intl/moz.build.

The base revision is https://hg.mozilla.org/mozilla-central/rev/fe8b783001368ba8e915884d7eb5f45ab7cb762f
(In reply to Henri Sivonen (:hsivonen) from comment #94)
> (In reply to Jorg K (GMT+2) from comment #92)
> > Part 2 doesn't apply to current trunk at time of writing. Against which rev
> > can I apply this? Or do you want to keep rebasing this. I get a failed hunk
> > in intl/moz.build.
> 
> The base revision is
> https://hg.mozilla.org/mozilla-central/rev/
> fe8b783001368ba8e915884d7eb5f45ab7cb762f

The lack of https://hg.mozilla.org/try/rev/dc50a64d1aed in your tree may cause the conflict.

Also, it's good to import https://hg.mozilla.org/try/rev/37370702f29f to avoid undefined behavior. (I need to fix bug 1359874 more properly before the patches here can land.)
Hmm, thanks for letting me know. Still seems to be a bit of a shifting ground ;-( - I will need to set up a different development machine for this, since compiling takes me about 80 minutes and if I have to work on something else with M-C at tip, it's another compile.
(In reply to Henri Sivonen (:hsivonen) from comment #87)
> Using cheddar from a build.rs makes the development experience miserable,
> because building cheddar's dependencies is so slow. Additionally, per
> previous dev-platform discussion, we don't have a good way to run cheddar
> before the C++ build needs to see the headers.

IIRC Webrender is using https://github.com/jrmuizel/wr-binding/ which is a full-rust bindings generator written specifically for webrender (Jeff will correct me if I'm wrong, I'm sure). That should be much nicer to use than cheddar because it doesn't require c++ dependencies.

In addition, I believe that the situation for building & running rust code before headers has changed now, but I might be wrong.
Flags: needinfo?(jmuizelaar)
(In reply to Michael Layzell [:mystor] from comment #97)
> (In reply to Henri Sivonen (:hsivonen) from comment #87)
> > Using cheddar from a build.rs makes the development experience miserable,
> > because building cheddar's dependencies is so slow. Additionally, per
> > previous dev-platform discussion, we don't have a good way to run cheddar
> > before the C++ build needs to see the headers.
> 
> IIRC Webrender is using https://github.com/jrmuizel/wr-binding/ which is a
> full-rust bindings generator written specifically for webrender (Jeff will
> correct me if I'm wrong, I'm sure). That should be much nicer to use than
> cheddar because it doesn't require c++ dependencies.
> 
> In addition, I believe that the situation for building & running rust code
> before headers has changed now, but I might be wrong.

The m-c-specific code at issue is this manually-written 'extern "C"' block for exposing the encoding_glue crate (the nsstring-aware wrapper for encoding_rs) to C++:
https://hg.mozilla.org/try/file/54c7dfc6133b/intl/Encoding.h#l31

I don't expect those declarations to change often (if at all) in the future, so I think we shouldn't over-engineer its generation.

The C header for the encoding_c crate (i.e. the non-Gecko-specific FFI wrapper for encoding_rs) is not an m-c-specific code generation concern but a matter for a crates.io crate (at this time, a cheddar-generated header that's distributed via crates.io instead of being generated by the recipient).
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review146448

I only looked at intl/encoding_glue/src/lib.rs - as AFAIK that is the only component which uses the rust `nsstring` API. I didn't see any incorrect uses of the API off of the top of my head (assuming that it is safe to pass an uninitialized output buffer to `encoding_rs` - which I presume it is).

My biggest complaint is just that there is a lot of repetition in unsafe code. This seems unfortunate and error prone to me.

Perhaps we would want to add some methods to our rust nsA[C]String bindings to make this nicer, like `nsA[C]String.set_fallible_usize_length()` which would take a `usize` and check that the length doesn't exceed `u32`? Not sure if that's worth doing, but I imagine that lots of rust bindings will want to do things like that.

::: intl/encoding_glue/src/lib.rs:60
(Diff revision 25)
> +}
> +
> +pub fn decode_to_nsstring_without_bom_handling(encoding: &'static Encoding, src: &[u8], dst: &mut nsAString) -> nsresult {
> +	let mut decoder = encoding.new_decoder_without_bom_handling();
> +    let needed = match decoder.max_utf16_buffer_length(src.len()) {
> +    	Some(max) => {

NIT: This indentation, among others, is actually a tab instead of 4 spaces. Please use 4 spaces for indentation in rust files.

You can use rustfmt to enforce this automatically.

::: intl/encoding_glue/src/lib.rs:99
(Diff revision 25)
> +    let needed = match decoder.max_utf16_buffer_length(src.len()) {
> +        Some(max) => {
> +            // XPCOM strings use uint32_t for length.
> +            if max > ::std::u32::MAX as usize {
> +                return NS_ERROR_OUT_OF_MEMORY;
> +            }
> +            max as u32
> +        }
> +        None => {
> +            return NS_ERROR_OUT_OF_MEMORY;
> +        }
> +    };

It would be nice to refactor this truncating behavior out. Perhaps using a macro like `try_u32_len!(decoder.max_utf16_buffer_length(src.len()), NS_ERROR_OUT_OF_MEMORY)` which would take `Option<usize>` and produce `u32`, early returning a `NS_ERROR_OUT_OF_MEMORY` if the length doesn't fit in a u32 or is `None`.

Perhaps this macro should be unsafe and also handle setting the length of the destination nsAString?

::: intl/encoding_glue/src/lib.rs:139
(Diff revision 25)
> +    }
> +}
> +
> +#[no_mangle]
> +pub unsafe extern "C" fn mozilla_encoding_encode_from_utf16(encoding: *mut *const Encoding, src: *const u16, src_len: usize, dst: *mut nsACString) -> nsresult {
> +    let (rv, enc) = encode_from_utf16(&**encoding, ::std::slice::from_raw_parts(src, src_len), &mut *dst);

Why are you using `::std::slice::from_raw_parts` instead of `use std::slice;` at the top and then `slice::from_raw_parts`?

::: intl/encoding_glue/src/lib.rs:147
(Diff revision 25)
> +    if let Some(needed) = encoder.max_buffer_length_from_utf16_if_no_unmappables(src.len()) {
> +        // XPCOM strings use uint32_t for length.
> +        if needed > ::std::u32::MAX as usize {
> +            return (NS_ERROR_OUT_OF_MEMORY, output_encoding);
> +        }
> +        unsafe {
> +            if dst.fallible_set_length(needed as u32).is_err() {
> +                return (NS_ERROR_OUT_OF_MEMORY, output_encoding);
> +            }
> +        }
> +    } else {
> +        return (NS_ERROR_OUT_OF_MEMORY, output_encoding);
> +    }

Please use a consistent formatting for the `fallible_set_length` function. It would be nice to be able to visually scan the file and easily tell that they're all doing the same thing.

::: intl/encoding_glue/src/lib.rs:375
(Diff revision 25)
> +                             decoder.max_utf8_buffer_length_without_replacement(bytes.len() -
> +                                                                                valid_up_to)) {

nit: formatting

::: intl/encoding_glue/src/lib.rs:515
(Diff revision 25)
> +                    if let Some(with_bookkeeping) = 9usize.checked_add(needed) {
> +                        let rounded = with_bookkeeping.next_power_of_two();
> +                        let unclowned = rounded - 9usize;

Perhaps pull these two instances of `9usize` into a `const NS_STRING_BUFFER_OVERHEAD: usize = 9;`?
Attachment #8862368 - Flags: review+
(In reply to Michael Layzell [:mystor] from comment #97)
> (In reply to Henri Sivonen (:hsivonen) from comment #87)
> > Using cheddar from a build.rs makes the development experience miserable,
> > because building cheddar's dependencies is so slow. Additionally, per
> > previous dev-platform discussion, we don't have a good way to run cheddar
> > before the C++ build needs to see the headers.
> 
> IIRC Webrender is using https://github.com/jrmuizel/wr-binding/ which is a
> full-rust bindings generator written specifically for webrender (Jeff will
> correct me if I'm wrong, I'm sure). That should be much nicer to use than
> cheddar because it doesn't require c++ dependencies.

https://github.com/jrmuizel/wr-binding/ has been replaced with https://github.com/rlhunt/cbindgen. cbindgen supports projects other than webrender.
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(michael)
Something is not right about the SIMD vs. no-SIMD builds here. They don't show the expected performance difference.
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review146448

> NIT: This indentation, among others, is actually a tab instead of 4 spaces. Please use 4 spaces for indentation in rust files.
> 
> You can use rustfmt to enforce this automatically.

Ran rustfmt and reconfigured editor defaults.

> It would be nice to refactor this truncating behavior out. Perhaps using a macro like `try_u32_len!(decoder.max_utf16_buffer_length(src.len()), NS_ERROR_OUT_OF_MEMORY)` which would take `Option<usize>` and produce `u32`, early returning a `NS_ERROR_OUT_OF_MEMORY` if the length doesn't fit in a u32 or is `None`.
> 
> Perhaps this macro should be unsafe and also handle setting the length of the destination nsAString?

Factored out into a macro.

> Why are you using `::std::slice::from_raw_parts` instead of `use std::slice;` at the top and then `slice::from_raw_parts`?

Using the fully-qualified name seemed simpler than `use`. Switched to `use`.

> Please use a consistent formatting for the `fallible_set_length` function. It would be nice to be able to visually scan the file and easily tell that they're all doing the same thing.

Used the same macro here.

> nit: formatting

Ran rustfmt.

> Perhaps pull these two instances of `9usize` into a `const NS_STRING_BUFFER_OVERHEAD: usize = 9;`?

Hoisted to constant. While at it, made the other instance use the same (right) constant (was a wrong number due to thought process error).
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review146448

> I only looked at intl/encoding_glue/src/lib.rs - as AFAIK that is the only component which uses the rust nsstring API.

Correct.

> I didn't see any incorrect uses of the API off of the top of my head 

Thanks.

> (assuming that it is safe to pass an uninitialized output buffer to encoding_rs - which I presume it is).

It is safe (with `&mut [u8]` and `&mut [u16]` buffers; `&mut str`, which isn't used here, has to live up to its type system invariants).

> My biggest complaint is just that there is a lot of repetition in unsafe code. This seems unfortunate and error prone to me.

This was somewhat mitigated by the new macro.

> Perhaps we would want to add some methods to our rust nsA[C]String bindings to make this nicer, like nsA[C]String.set_fallible_usize_length() which would take a usize and check that the length doesn't exceed u32? Not sure if that's worth doing, but I imagine that lots of rust bindings will want to do things like that.

I think the new macro makes this need less pressing, so not pursuing as part of this patch.
Attachment #8862368 - Flags: review?(VYV03354)
Attachment #8862369 - Flags: review?(VYV03354)
(In reply to Henri Sivonen (:hsivonen) from comment #82)
> The list of the remaining work before landing is:
> 
>  * Properly connect build system configure flags to cargo features.

Deferred to upcoming part 4 to make part reviewable.

>  * Upstream the copies of encoding_rs C headers from under intl/ to the
> encoding_c crate and use them directly from under
> third_party/rust/encoding_c/include/.

Done.

(In reply to Jorg K (GMT+2) from comment #96)
> Hmm, thanks for letting me know. Still seems to be a bit of a shifting
> ground ;-( - I will need to set up a different development machine for this,
> since compiling takes me about 80 minutes and if I have to work on something
> else with M-C at tip, it's another compile.

I've rebased onto https://hg.mozilla.org/integration/mozilla-inbound/rev/4340666eb8e0 .

There's now one fewer changesets for you to import. From top to bottom:
https://hg.mozilla.org/try/rev/dfc09ea41258
https://hg.mozilla.org/try/rev/6767a326ba28
https://hg.mozilla.org/try/rev/6f0f45377c55
https://hg.mozilla.org/try/rev/37d4d9ab562d
(In reply to Henri Sivonen (:hsivonen) from comment #91)
> Ted, what am I doing wrong in
> https://hg.mozilla.org/try/diff/54c7dfc6133b/toolkit/moz.configure that
> makes the "if" conditions I added to
> https://hg.mozilla.org/try/diff/54c7dfc6133b/toolkit/library/rust/gkrust-
> features.mozbuild always evaluate to true (if uncommented)?

I don't know why the above didn't work, but my latest variant of the same stuff works. (Not uploaded yet.)
Flags: needinfo?(ted)
(In reply to Henri Sivonen (:hsivonen) from comment #110)
> There's now one fewer changesets for you to import. From top to bottom:
> https://hg.mozilla.org/try/rev/dfc09ea41258
> https://hg.mozilla.org/try/rev/6767a326ba28
> https://hg.mozilla.org/try/rev/6f0f45377c55
> https://hg.mozilla.org/try/rev/37d4d9ab562d
Thanks, I've just completed setting up another machine for this work and I managed to import those (in reverse order).
Comment on attachment 8862367 [details]
Bug 1261841 part 1 - Vendor encoding_rs and encoding_c into m-c. rs=emk,SimonSapin.

https://reviewboard.mozilla.org/r/134308/#review147692

rs=me (I didn't actually look into this. I just trust Henri.)
Attachment #8862367 - Flags: review+
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review147696

::: docshell/base/nsDocShell.cpp:11628
(Diff revision 27)
>        nsCOMPtr<nsITextToSubURI> textToSubURI =
>          do_GetService(NS_ITEXTTOSUBURI_CONTRACTID, &rv);
>        NS_ENSURE_SUCCESS(rv, rv);
>  
>        // Unescape and convert to unicode
>        nsXPIDLString uStr;

nsXPIDLString is no longer neccessary because getter_Copies is no longer used.

::: dom/base/BodyUtil.cpp:492
(Diff revision 27)
>  BodyUtil::ConsumeText(uint32_t aInputLength, uint8_t* aInput,
>                         nsString& aText)
>  {
> -  StreamDecoder decoder;
> -  nsresult rv = decoder.AppendText(reinterpret_cast<char*>(aInput),
> -                                   aInputLength);
> +  nsresult rv =
> +    UTF_8_ENCODING->DecodeWithBOMRemoval(MakeSpan(aInput, aInputLength), aText);
> +  if (NS_SUCCEEDED(rv)) {

nit: please check failure instead of success as the old code did.

::: dom/base/EventSource.cpp:725
(Diff revision 27)
>      return;
>    }
> -  int32_t srcCount, outCount;
> -  char16_t out[2];
> -  const char* p = aBuffer;
> -  const char* end = aBuffer + aLength;
> +  char16_t buffer[1024];
> +  auto dst = MakeSpan(buffer);
> +  auto src = AsBytes(MakeSpan(aBuffer, aLength));
> +  // XXX no EOF handling

Is this an existing issue?

::: dom/base/nsContentUtils.h:622
(Diff revision 27)
>                                              nsIURI* aBaseURI);
>  
>    /**
>     * Convert aInput (in encoding aEncoding) to UTF16 in aOutput.
>     *
> +   * @deprecated Use mozilla::Encoding::DecodeWithBOMRemoval() in new code.

Why you could not remove this function? Is it a follow-up?

::: dom/base/nsContentUtils.h:635
(Diff revision 27)
>  
>    /**
>     * Determine whether a buffer begins with a BOM for UTF-8, UTF-16LE,
>     * UTF-16BE
>     *
> +   * @deprecated Use mozilla::Encoding::ForBOM() in new code.

ditto.

::: dom/base/nsContentUtils.cpp:4498
(Diff revision 27)
> -  bool found = true;
> -  aCharset.Truncate();
> -  if (aLength >= 3 &&
> -      aBuffer[0] == 0xEF &&
> -      aBuffer[1] == 0xBB &&
> -      aBuffer[2] == 0xBF) {
> +  auto span = MakeSpan(reinterpret_cast<const uint8_t*>(aBuffer), aLength);
> +  const Encoding* encoding;
> +  size_t bomLength;
> +  Tie(encoding, bomLength) = Encoding::ForBOM(span);
> +  Unused << bomLength;
> +  if (encoding) {

nit: failure case first.

::: dom/base/nsXHTMLContentSerializer.cpp:186
(Diff revision 27)
>    }
>  
>    int32_t start = 0;
>    int32_t end;
>    nsAutoString part;
>    nsXPIDLCString escapedURI;

nsXPIDLCString is no longer neccessary because getter_Copies is no longer used.

::: dom/base/test/unit/test_xml_serializer.js:288
(Diff revision 27)
> -  var str2 = '<?xml version="1.0" encoding="UTF8"?>'+LB+'<root/>';
> +  var str2 = '<?xml version="1.0" encoding="UTF-8"?>'+LB+'<root/>';
>    var doc1 = ParseXML(str1);
>    var doc2 = ParseXML(str2);
>  
>    var p = Pipe();
> -  DOMSerializer().serializeToStream(doc1, p.outputStream, "ISO-8859-1");
> +  DOMSerializer().serializeToStream(doc1, p.outputStream, "windows-1252");

serializeToStream will not accept non-Encoding Label anymore? Is it possible before Firefox 57?

::: dom/encoding/EncodingUtils.h:28
(Diff revision 27)
>     * Given a label, this function returns the corresponding encoding or a
>     * false.
>     * The returned name may not be lowercased due to compatibility with
>     * our internal implementations.
>     *
> +   * @deprecated Use mozilla::Encoding::ForLabel() in new code.

Again, is it a follow-up to remove them?

::: dom/encoding/EncodingUtils.cpp:28
(Diff revision 27)
>  bool
>  EncodingUtils::FindEncodingForLabel(const nsACString& aLabel,
>                                      nsACString& aOutEncoding)
>  {
> -  // Save aLabel first because it may refer the same string as aOutEncoding.
> -  nsCString label(aLabel);
> +  auto encoding = Encoding::ForLabel(aLabel);
> +  if (encoding) {

failure case first, please.

::: dom/encoding/EncodingUtils.cpp:41
(Diff revision 27)
>  bool
>  EncodingUtils::FindEncodingForLabelNoReplacement(const nsACString& aLabel,
>                                                   nsACString& aOutEncoding)
>  {
> -  if(!FindEncodingForLabel(aLabel, aOutEncoding)) {
> -    return false;
> +  auto encoding = Encoding::ForLabelNoReplacement(aLabel);
> +  if (encoding) {

Ditto.

::: dom/fetch/BodyExtractor.cpp:138
(Diff revision 27)
> -  if (NS_WARN_IF(NS_FAILED(rv))) {
> -    return rv;
> -  }
> -
>    nsCString encoded;
> -  if (!encoded.SetCapacity(destBufferLen, fallible)) {
> +  CopyUTF16toUTF8(*mBody, encoded);

Is mBody guaranteed to be valid?

::: dom/html/HTMLFormSubmission.h:126
(Diff revision 27)
>  
>  protected:
>    /**
>     * Can only be constructed by subclasses.
>     *
> -   * @param aCharset the charset of the form as a string
> +   * @param aEncdoing the encoding of the form

"the character encoding"? (because it is confusable with enctype in this context.)

::: dom/html/HTMLFormSubmission.h:137
(Diff revision 27)
>      , mOriginatingElement(aOriginatingElement)
>    {
>      MOZ_COUNT_CTOR(HTMLFormSubmission);
>    }
>  
> -  // The name of the encoder charset
> +  // The encoding of this form submission

ditto.

::: dom/html/HTMLFormSubmission.h:173
(Diff revision 27)
>   */
>  class FSMultipartFormData : public EncodingFormSubmission
>  {
>  public:
>    /**
> -   * @param aCharset the charset of the form as a string
> +   * @param aEncoding the encoding of the form

ditto.

::: dom/html/HTMLFormSubmission.cpp:93
(Diff revision 27)
>  
>  class FSURLEncoded : public EncodingFormSubmission
>  {
>  public:
>    /**
> -   * @param aCharset the charset of the form as a string
> +   * @param aEncoding the encoding of the form

ditto.

::: dom/html/nsHTMLDocument.cpp:722
(Diff revision 27)
>      TryTLD(charsetSource, charset);
>      TryFallback(charsetSource, charset);
>  
>      if (wyciwygChannel) {
>        // We know for sure that the parser needs to be using UTF16.
> -      parserCharset = "UTF-16";
> +      parserCharset = "UTF-16LE";

Is this valid for big-endian platforms (e.g. TenFourFox)?

::: dom/json/nsJSON.cpp:96
(Diff revision 27)
> -static const char UTF16BEBOM[] = "\xFE\xFF";
>  
>  static nsresult CheckCharset(const char* aCharset)
>  {
>    // Check that the charset is permissible
> -  if (!(strcmp(aCharset, "UTF-8") == 0 ||
> +  if (!(strcmp(aCharset, "UTF-8") == 0)) {

Is it possible to kill non-UTF-8 support before Firefox 57?

::: dom/plugins/base/nsPluginTags.cpp:450
(Diff revision 27)
>  
>  #if !defined(XP_WIN) && !defined(XP_MACOSX)
> -static nsresult ConvertToUTF8(nsIUnicodeDecoder *aUnicodeDecoder,
> -                              nsAFlatCString& aString)
> +static void
> +ConvertToUTF8(nsAFlatCString& aString)
>  {
> -  int32_t numberOfBytes = aString.Length();
> +  Unused << UTF_8_ENCODING->DecodeWithoutBOMHandling(aString, aString);

Are you sure Flash Player for Linux will never use non-UTF-8 strings? If so, What is this doing beyond no-op?

::: dom/script/ScriptLoader.cpp:2465
(Diff revision 27)
> -    charset = "?";
> -  }
> +
> +  nsAutoCString charset;
> +  unicodeDecoder->Encoding()->Name(charset);
>    mozilla::Telemetry::Accumulate(mozilla::Telemetry::DOM_SCRIPT_SRC_ENCODING,
>      charset);
> -  return rv;
> +  return NS_OK;

Will this conversion never fail?

::: extensions/spellcheck/hunspell/glue/mozHunspell.cpp:578
(Diff revision 27)
> +        (*aSuggestions)[index] = (char16_t*)moz_xmalloc(needed.value());
> +        if (!((*aSuggestions)[index])) {
> -            rv = NS_ERROR_OUT_OF_MEMORY;
> +          rv = NS_ERROR_OUT_OF_MEMORY;
> +          break;

moz_xmalloc is infallible. If you intended to make this fallible, use a plain malloc. Otherwise, remove the redundant error handling.

::: intl/Encoding.h:162
(Diff revision 27)
> + * decode using the UTF-8 encoding, use the `UTF_8_ENCODING` `static`.
> + *
> + * If you don't know what encoding you need at compile time and need to
> + * dynamically get an encoding by label, use `Encoding::for_label()`.
> + *
> + * Instances of `Encoding` can be compared with `==`.

Does this guarantee make sense? (When C++ callers use non-pointer Encode instance?)

::: intl/Encoding.h:181
(Diff revision 27)
> +   * If, after ASCII-lowercasing and removing leading and trailing
> +   * whitespace, the argument matches a label defined in the Encoding
> +   * Standard, `const Encoding*` representing the corresponding
> +   * encoding is returned. If there is no match, `nullptr` is returned.
> +   */
> +  static inline const Encoding* ForLabel(Span<const char> aLabel)

Methods defined in class definition are inline by default. "inline" is redundant. (Same for most methods in this header.)

::: intl/Encoding.h:187
(Diff revision 27)
> +  /**
> +   * Implements the _get an encoding_ algorithm
> +   * (https://encoding.spec.whatwg.org/#concept-encoding-get).
> +   *
> +   * If, after ASCII-lowercasing and removing leading and trailing
> +   * whitespace, the argument matches a label defined in the Encoding
> +   * Standard, `const Encoding*` representing the corresponding
> +   * encoding is returned. If there is no match, `nullptr` is returned.

I don't think we need to repeat the same comment twice.

::: intl/Encoding.h:241
(Diff revision 27)
> +   *
> +   * Returns `MakeTuple(UTF_8_ENCODING, 3)`, `MakeTuple(UTF_16LE_ENCODING, 2)`
> +   * or `MakeTuple(UTF_16BE_ENCODING, 3)` if the argument starts with the
> +   * UTF-8, UTF-16LE or UTF-16BE BOM or `MakeTuple(nullptr, 0)` otherwise.
> +   */
> +  static inline Tuple<const Encoding*, size_t> ForBOM(

Could you add a newline before the method name? It is hard to find the method name when the return value type is wrong. (Same for all methods in thos header.)

::: intl/Encoding.h:797
(Diff revision 27)
> +    return encoding_iso_2022_jp_ascii_valid_up_to(aBuffer.Elements(),
> +                                                  aBuffer.Length());
> +  }
> +
> +private:
> +  Encoding() = delete;

Please also make this class non-copyable (delete copy constructors and assignment operators).

::: intl/encoding_glue/src/lib.rs:1
(Diff revision 27)
> +// Copyright 2015-2016 Mozilla Foundation. See the COPYRIGHT

I didn't review the Rust code.

::: intl/uconv/nsConverterInputStream.cpp:239
(Diff revision 27)
> -      // bug 160784 again
> -      srcConsumed = std::max<uint32_t>(srcConsumed, 0);
> -      mConverter->Reset();
> -    }
> -    NS_ASSERTION(srcConsumed <= mByteData.Length(),
> -                 "Whoa.  The converter should have returned NS_OK_UDEC_MOREINPUT before this point!");
> +  mUnicharDataLength = written;
> +  if (result == kInputEmpty || result == kOutputFull) {
> +    *aErrorCode = NS_OK;
> +  } else {
> +    MOZ_ASSERT(mErrorsAreFatal, "How come DecodeToUTF16() reported error?");
> +    *aErrorCode = NS_ERROR_UDEC_ILLEGALINPUT;

Are callers (especially add-ons) ready to receive errors from this method?

::: netwerk/streamconv/converters/nsIndexedToHTML.cpp:503
(Diff revision 27)
>      if (NS_FAILED(rv)) return rv;
>      if (encoding.IsEmpty()) {
>        encoding.AssignLiteral("UTF-8");
>      }
>  
>      nsXPIDLString unEscapeSpec;

nsXPIDLString is no longer necessary.
Attachment #8862368 - Flags: review?(VYV03354) → review-
Comment on attachment 8862369 [details]
Bug 1261841 part 3 - Remove uconv files. .

https://reviewboard.mozilla.org/r/134312/#review147734
Attachment #8862369 - Flags: review?(VYV03354) → review+
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review147696

> Is it possible to kill non-UTF-8 support before Firefox 57?

I checked a massive number of extensions (not all of them), and I didn't find a single one emitting UTF-16 JSON.
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review147696

> serializeToStream will not accept non-Encoding Label anymore? Is it possible before Firefox 57?

It will accept any label but the output will use the Encoding Standard name, so you can still ask for `ISO-8859-1`, but the output will be labeled `windows-1252`. Due to the structure of the test case, it was easier to change the input label, too, in the test case.
(In reply to Henri Sivonen (:hsivonen) from comment #119)
> Comment on attachment 8862368 [details]
> Bug 1261841 part 2 - Use encoding_rs instead of uconv.
> 
> https://reviewboard.mozilla.org/r/134310/#review147696
> 
> > Is it possible to kill non-UTF-8 support before Firefox 57?
> 
> I checked a massive number of extensions (not all of them), and I didn't
> find a single one emitting UTF-16 JSON.

For nsIScriptableUnicodeConverter and nsIConverterOutputStream, I find extensions that have code for encoding to UTF-16 but I don't see them calling that code with arguments that would activate the UTF-16 path. (This check was less thorough than the nsIJSON check, though.)
Blocks: 1369018
Blocks: 1369020
Blocks: 1369022
Blocks: 1369032
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review147696

> nit: please check failure instead of success as the old code did.

Changed.

> Is this an existing issue?

Filed bug 1369018.

> Why you could not remove this function? Is it a follow-up?

This patch is massive as-is, so I figured things like this would be follow-ups. Filed bug 1369020.

> ditto.

Filed bug 1369022.

> nit: failure case first.

Changed.

> nsXPIDLCString is no longer neccessary because getter_Copies is no longer used.

Changed to `nsAutoCString`.

> Again, is it a follow-up to remove them?

Filed bug 1369025 and bug 1369032.

> failure case first, please.

Changed.

> Ditto.

Changed.

> Is mBody guaranteed to be valid?

The old code also dereferences the pointer unconditionally.

> "the character encoding"? (because it is confusable with enctype in this context.)

Changed.

> ditto.

Changed.

> ditto.

Changed.

> ditto.

Changed.

> Is this valid for big-endian platforms (e.g. TenFourFox)?

It appears that it is, because the data comes with a BOM: https://searchfox.org/mozilla-central/source/netwerk/protocol/wyciwyg/nsWyciwygChannel.cpp#457

> Are you sure Flash Player for Linux will never use non-UTF-8 strings? If so, What is this doing beyond no-op?

According to bug 943276 comment 4, it seems that supporting non-UTF-8 is not worthwhile. However, in case non-UTF-8 somehow found its way here, this code would coerce bogus input into valid UTF-8 (with REPLACEMENT CHARACTERs).

> Will this conversion never fail?

It can't fail. The converter does not perform allocations and deals with bogus input.

> moz_xmalloc is infallible. If you intended to make this fallible, use a plain malloc. Otherwise, remove the redundant error handling.

I used what the old code used. Filed bug 1369647 to get this fixed as a follow-up by people more knowledgeable of the module's needs.

> Does this guarantee make sense? (When C++ callers use non-pointer Encode instance?)

Reworded the comment to talk about pointers, which is sure to make sense. (There isn't supposed to be a way to obtain a pointer by means other than getting a pointer to the static instances.)

> Methods defined in class definition are inline by default. "inline" is redundant. (Same for most methods in this header.)

I thought I saw a linker error somewhere before I added `inline`. Inline sure doesn't hurt here technically. I can test if things link of every platform without `inline` (and my recollection is mistaken), but I think that makes sense to test as a separate changeset in order to be able to revert. Hence, not changing this at this time.

> I don't think we need to repeat the same comment twice.

Changed the copied comment to a note to read the other comment. Also expanded the other comment with detail.

> Could you add a newline before the method name? It is hard to find the method name when the return value type is wrong. (Same for all methods in thos header.)

The formatting is "whatever `./mach clang-format` does". I don't have a strong opinion what the right formatting is, but I don't want to fight our tools. Therefore, if you feel strongly about the formatting being wrong here, I suggest filing a bug about `mach clang-format`.

> Please also make this class non-copyable (delete copy constructors and assignment operators).

Added. For `Decoder` and `Encoder`, too.

> I didn't review the Rust code.

Mystor already did.

> Are callers (especially add-ons) ready to receive errors from this method?

AFAICT, it was already possible for `*aErrorCode` to end up getting set to `NS_ERROR_UDEC_ILLEGALINPUT` if the user of the class asked for no replacement.

> nsXPIDLString is no longer necessary.

Changed to `nsAutoString`.
I'm going to make other changes to the patch queue before re-pushing, but I published the above comments now in order not to lose them by accident.
The logic claimed by comments in nsScriptableUnicodeConverter::ConvertFromUnicode was wrong, but, AFAICT, the result was right, so I changed the comments and added a release assert as a seat belt in case my reasoning is still wrong. (While at it, I updated encoding_rs to make Encoding::has_pending_state() public, because it needs to be public for consumers of the library to be able to correctly implement custom replacements of any length and exposed mozilla::Encoding::HasPendingState(), which I thought I'd need for nsScriptableUnicodeConverter::ConvertFromUnicode(), but on second thought, I didn't need it this time, after all, because the replecement is guaranteed to have the length of one in nsScriptableUnicodeConverter::ConvertFromUnicode().)
OK, this time the SIMD case (comment 132) actually used SIMD (on x86 and x86_64).

https://docs.google.com/spreadsheets/d/1BZcRK7wHhNq7zGYVreEn6mc-xvxzyjBG2mgQrEqZgWg/edit shows results obtained from https://hsivonen.com/test/moz/encoding_bench_web/ on 64-bit Windows using builds from the pushes of comment 130 and comment 132.

That ISO-2022-JP regresses perf a bit is known and, I think, acceptable. ISO-2022-JP is so rare that it isn't worth optimizing for speed and it's better to optimize for correctness and maintainability. (UTF-16 decode regresses perf, too, for the same reason, but I might fix that as a follow-up.)

When tested via TextDecoder/TextEncoder, the overhead of the bindings is significant. On the same computer with SIMD and with the same test data, I see 9.1 GB/s for English UTF-8 decode in a test harness without extra copying while this table shows 1.3 GB/s. For Japanese UTF-8 decode, it's 1.9 GB/s vs. 0.79 GB/s.

With encoding_rs (the "alu" and "simd" in the table), the js decode should be running the same code for all non-ISO-2022-JP encodings. The variation in perf seen in the table shows how much noise there is when testing within a browser. In that sense, the 64 ALU case performing worse than Chrome for Korean UTF-8 decode is well within the noise. Also, the uconv64 run for gb18030 js decode (marked gray) is clearly bogus.

The massive perf win over uconv euc-kr isn't bogus. uconv's euc-kr implementation really is that much slower.

Chrome's performance for UTF-8 and windows-1252 when the input is 100% ASCII looks like a specific optimization, since those two use WTF instead of ICU, and not like a random chance.

(Language js means all-ASCII JavaScript. Minified jquery specifically.)

This testing suggests that the Talos regressions seen on the comment 130 and comment 132 pushes are bogus.
Oh, and natural-language decode is HTML (from Wikipedia) but natural-language encode is plain text. See https://github.com/hsivonen/encoding_bench/tree/master/src/wikipedia
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review147696

> I thought I saw a linker error somewhere before I added `inline`. Inline sure doesn't hurt here technically. I can test if things link of every platform without `inline` (and my recollection is mistaken), but I think that makes sense to test as a separate changeset in order to be able to revert. Hence, not changing this at this time.

OK.
Comment on attachment 8862368 [details]
Bug 1261841 part 2 - Use encoding_rs instead of uconv. .

https://reviewboard.mozilla.org/r/134310/#review150744

LGTM.
Attachment #8862368 - Flags: review?(VYV03354) → review+
(In reply to Masatoshi Kimura [:emk] from comment #136)
> LGTM.

Thank you!
Attachment #8872394 - Flags: review?(nfroyd)
Attachment #8872395 - Flags: review?(nfroyd)
Sigh. I just noticed that part 5 (comment 132) here makes Android on 32-bit ARM fail due to cargo locking issues. (Curiously, Android x86 and Aarch64 seem to be OK.)
(In reply to Henri Sivonen (:hsivonen) from comment #82)
>  * Make UTF-16 to UTF-8 encode of non-Latin content go faster on x86_64
> (probably by moving the surrogate presence check to SSE2).

I failed to do this without trading off ASCII performance. However, I managed to boost the ASCII case even further. I intend to land without trying to optimize non-ASCII UTF-16 to UTF-8 encode perf on x86_64 further.

>  * Make UTF-8 to UTF-16 decode and UTF-16 to UTF-8 encode go faster on ARMv7
> (probably by accelerating ASCII handling [both directions] and surrogate
> checks [encoder only] using NEON).

It seems that this had gotten fixed (how? dunno. maybe LLVM update in rustc. maybe bad thermal throttling when I last tested) since I last tested without explicit action on my part. Now even without NEON, encoding_rs outperforms uconv on Raspberry Pi 3 (ARMv8 core running ARMv7 code, Broadcom, GNU Linux), Nexus 5 (ARMv7, Qualcomm, Android) and Galaxy SII Plus (ARMv7, Broadcom, Android) in the cases I was concerned about.
(In reply to Henri Sivonen (:hsivonen) from comment #138)
> Sigh. I just noticed that part 5 (comment 132) here makes Android on 32-bit
> ARM fail due to cargo locking issues. (Curiously, Android x86 and Aarch64
> seem to be OK.)

Oh, it's a matter of missing RUSTFLAGS=' -C target-feature=+neon'.
Comment on attachment 8872395 [details]
Bug 1261841 part 5 - Enable explicit SIMD in Rust in automation infra. .

https://reviewboard.mozilla.org/r/143900/#review152010

::: build/mozconfig.common:19
(Diff revision 2)
>  
>  ac_add_options --enable-crashreporter
>  
>  ac_add_options --enable-release
>  
> +ac_add_options --enable-rust-simd

Let's put this in build/mozconfig.rust instead.
Attachment #8872395 - Flags: review?(nfroyd) → review+
Comment on attachment 8872394 [details]
Bug 1261841 part 4 - Add a configuration option for enabling explicit SIMD in Rust. .

https://reviewboard.mozilla.org/r/143898/#review152014

Please wait to land these patches until 54 is released next week.

::: toolkit/library/rust/Cargo.toml:14
(Diff revision 2)
>  bindgen = ["gkrust-shared/bindgen"]
>  servo = ["gkrust-shared/servo"]
>  quantum_render = ["gkrust-shared/quantum_render"]
>  cubeb_pulse_rust = ["gkrust-shared/cubeb_pulse_rust"]
>  gecko_debug = ["gkrust-shared/gecko_debug"]
> -# simd-accel = ["gkrust-shared/simd-accel"]
> +simd-accel = ["gkrust-shared/simd-accel"]

Sort of confusing to not have the corresponding gkrust-shared change in this commit...
Attachment #8872394 - Flags: review?(nfroyd) → review+
Comment on attachment 8872394 [details]
Bug 1261841 part 4 - Add a configuration option for enabling explicit SIMD in Rust. .

https://reviewboard.mozilla.org/r/143898/#review152014

OK.

> Sort of confusing to not have the corresponding gkrust-shared change in this commit...

The uncommenting affects just the top-level crates. It doesn't seem worthwhile to move thing around between the parts at this point.
Comment on attachment 8872395 [details]
Bug 1261841 part 5 - Enable explicit SIMD in Rust in automation infra. .

https://reviewboard.mozilla.org/r/143900/#review152010

> Let's put this in build/mozconfig.rust instead.

Moved to mozconfig.rust.
Full try run with part 6:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd0381471271

Part 6 is needed to unbreak the Android ARMv7 build. (Actually using NEON there requires rustc changes. I already submitted one PR to that end, but it turned out another one is needed. Fortunately, encoding_rs outperforms uconv on ARMv7 even without NEON.)
(In reply to Simon Sapin (:SimonSapin) from comment #93)
> I’m happy to rubber-stamp encoding_rs for inclusion in m-c.

For clarity, was the above already rs=SimonSapin?
Flags: needinfo?(simon.sapin)
Blocks: 1372230
At this point, this is so close to landable that I'm going to stop updating part 1 and intend to defer the x86/x86_64 perf boost mentioned in comment 139 into a follow-up (bug 1372230).
Comment on attachment 8876730 [details]
Bug 1261841 part 6 - Make --enable-rust-simd a no-op on CPU architectures other than aarch64, x86 and x86_64. .

https://reviewboard.mozilla.org/r/148066/#review152404

::: toolkit/moz.configure:805
(Diff revision 1)
> -    if value:
> +    if target.cpu in ('aarch64', 'x86', 'x86_64') and value:
>          return True

Can you add a comment here about why we're choosing only these architectures, particularly why we're not doing ARM?
Attachment #8876730 - Flags: review?(nfroyd) → review+
(In reply to Henri Sivonen (:hsivonen) from comment #153)
> (In reply to Simon Sapin (:SimonSapin) from comment #93)
> > I’m happy to rubber-stamp encoding_rs for inclusion in m-c.
> 
> For clarity, was the above already rs=SimonSapin?

I’m told rs means rubberstamp, so yes. (I haven’t looked at the patches in this bug, only at github.com/hsivonen/encoding_rs a few months ago.)
Flags: needinfo?(simon.sapin)
Comment on attachment 8876730 [details]
Bug 1261841 part 6 - Make --enable-rust-simd a no-op on CPU architectures other than aarch64, x86 and x86_64. .

https://reviewboard.mozilla.org/r/148066/#review152404

> Can you add a comment here about why we're choosing only these architectures, particularly why we're not doing ARM?

Added comment.
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a799ece5b8ad
part 1 - Vendor encoding_rs and encoding_c into m-c. rs=emk,SimonSapin.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e155fa765af2
part 2 - Use encoding_rs instead of uconv. r=emk,mystor.
https://hg.mozilla.org/integration/mozilla-inbound/rev/4c8a154bf143
part 3 - Remove uconv files. r=emk.
https://hg.mozilla.org/integration/mozilla-inbound/rev/d87b44248f10
part 4 - Add a configuration option for enabling explicit SIMD in Rust. r=froydnj.
https://hg.mozilla.org/integration/mozilla-inbound/rev/b72bac14767a
part 5 - Enable explicit SIMD in Rust in automation infra. r=froydnj.
https://hg.mozilla.org/integration/mozilla-inbound/rev/68d154644c0f
part 6 - Make --enable-rust-simd a no-op on CPU architectures other than aarch64, x86 and x86_64. r=froydnj.
Henri, what kind of testing/QA is this going to need? Beyond the linked gdoc, is there more documentation about the risks of this change?
Flags: needinfo?(hsivonen)
Also, what's the fallback plan here if we find serious regressions given that Part 3 removed uconv?
See Also: → 1373045
(In reply to Ryan VanderMeulen [:RyanVM] from comment #172)
> Henri, what kind of testing/QA is this going to need?
The Thunderbird team is busily testing this, encoding and decoding left, right and centre ;-)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #172)
> Henri, what kind of testing/QA is this going to need?

Usual Web usage, mostly, preferably especially on sites that use legacy CJK encodings.

It would be great to have some adversarial unit test writing (unit tests written based on spec reading by someone other than the person who wrote the code). Sadly, it's not a thing we do. The code has had plenty of unit testing, though; see below.

> Beyond the linked gdoc, is there more documentation about the risks of this change?

No. This change is fairly low-risk despite the patch being huge.

 * encoding_rs implements the Encoding Standard, to the best of my knowledge, exactly. The spec has been developed carefully over several years to capture what's needed to for Web compat. The known Web-visible behavior changes made here amount to adopting Chrome/Safari behavior for some things, so presumably those behaviors are Web-compatible enough for Chrome and Safari to be able to get away with them.

 * Except for the test changes made for the above-mentioned intentional behavior changes, the new code passed our pre-existing test suite.

 * The code has been tested with tests from the W3C i18n Core WG. (Last I checked, the tests that didn't pass were test suite bugs.)

 * encoding_rs, outside the Firefox context, runs a lot of unit tests, including tests that cover all the index entries for the CJK legacy encodings.

 * encoding_rs, outside the Firefox context, has been fuzzed with cargo-fuzz, which uses LLVM's coverage-guided fuzzer.

 * The two crasher bugs seen so far (bug 1372994 and bug 1373045) are instances of an existing class of bugs in Gecko that an added release assertion revealed and that the changes to the set of canonical names made here tickled more than usual. Those are not cause for any unusual alarm. (The real fix is going to be bug 919924, for which the changes made here are a prerequisite.)

 * One known risk is the add-on impact of the removal of the capability of encoding to UTF-16. See comment 121. I checked a very large number of AMO-hosted extensions using dxr, but not all of them. Adding the addon-compat keyword.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #173)
> Also, what's the fallback plan here if we find serious regressions given
> that Part 3 removed uconv?

Fixing the bugs that are found. A bunch of upcoming things need functionality added here, so reverting outright isn't an appropriate remedy for problems even if problems were found.
Flags: needinfo?(hsivonen)
Keywords: addon-compat
Depends on: 1373546
Should downstream add --enable-rust-simd to their mozconfigs for Firefox 56+? Or do you plan to implicitly enable it for --enable-release?
Flags: needinfo?(hsivonen)
(In reply to Jan Beich from comment #176)
> Should downstream add --enable-rust-simd to their mozconfigs for Firefox
> 56+?

For 32-bit x86, yes if your builds require SSE2 anyway. In other cases: Yes. (It's a no-op on ISA where not supported. On 32-bit x86, it's not a no-op, but it also doesn't make sense if you are building for a non-SSE2 Rust target and C++ target generally.)

> Or do you plan to implicitly enable it for --enable-release?

I wasn't planning to, but maybe it would be good to do that.
Flags: needinfo?(hsivonen)
--enable-rust-simd also cannot be used with Rust built against non-default LLVM version.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223300
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: