Closed Bug 1490601 Opened Last year Closed 3 months ago

Make SpiderMonkey-relevant parts of encoding_rs::mem available to SpiderMonkey

Categories

(Core :: Internationalization, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox71 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Currently conversions between UTF-16 and Latin1 using mozilla::Span (i.e. without  XPCOM string dependency) as well as the operation for asking if a mozilla::Span<const char16_t> is entirely in the Latin1 range are exposed via nsReadableUtils.h.

These should be exposed via MFBT in order to be usable in SpiderMonkey.
Blocks: 1490602
That doesn't sound like a good idea. That would mean moving encoding_rs to mozglue.
(In reply to Mike Hommey [:glandium] from comment #1)
> That doesn't sound like a good idea. That would mean moving encoding_rs to
> mozglue.

I take it that I've fundamentally misunderstood what SpiderMonkey links. I though that when built in the context of Firefox, SpiderMonkey was able to call into libxul and that there now was a gkrust replacement for standalone SpiderMonkey.

What's the arrangement actually like and why can't SpiderMonkey call into code in gkrust?
Flags: needinfo?(mh+mozilla)
Because there's no gkrust replacement for standalone Spidermonkey. There's a similar problem in bug 1469027. Also, afaik, building Firefox with spidermonkey as a shared library is still a supported setup, so spidermonkey can't call into libxul directly.
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #3)
> Because there's no gkrust replacement for standalone Spidermonkey.

Oh, I thought we gained such a replacement a couple of weeks ago, but I guess not, as there's no Cargo.lock under js/.

> There's a
> similar problem in bug 1469027. Also, afaik, building Firefox with
> spidermonkey as a shared library is still a supported setup, so spidermonkey
> can't call into libxul directly.

I commented there.
Blocks: 1561567

(In reply to Mike Hommey [:glandium] from comment #3)

Because there's no gkrust replacement for standalone Spidermonkey.

My understanding is that this has now been fixed

Type: defect → enhancement

(In reply to Henri Sivonen (:hsivonen) from comment #5)

(In reply to Mike Hommey [:glandium] from comment #3)

Because there's no gkrust replacement for standalone Spidermonkey.

My understanding is that this has now been fixed

Is my understanding correct? That is, is it now OK to make both jsrust_shared and gkrust_shared transitively depend on encoding_rs so that a Firefox build ends up with one of encoding_rs? So bug 1490593 could introduce an encoding_c_mem crate depended on by jsrust_shared and wrapped for C++ by MFBT and both SpiderMonkey and the rest of Gecko could use those MFBT APIs?

Flags: needinfo?(mh+mozilla)

So bug 1490593 could introduce an encoding_c_mem crate depended on by jsrust_shared

Probably ; could be depended on by gkrust_shared too, presumably.

and wrapped for C++ by MFBT

No. MFBT is for templates. Not to expose things from rust, or other "heavyweight" things. Yes, there are some .cpp files in MFBT, but they'll eventually move somewhere else (bug 1554062). Mozglue is also not the right place, because while it's common to gecko and js, it's not statically linked to them, so you can't go from gecko C++ to gecko rust via mozglue. Presumably, you want to generate C headers for encoding_c_mem with cbindgen anyways.

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #7)

So bug 1490593 could introduce an encoding_c_mem crate depended on by jsrust_shared

Probably ; could be depended on by gkrust_shared too, presumably.

Thanks.

and wrapped for C++ by MFBT

No. MFBT is for templates.

I don't see why we need to be so strict about the name matching the content.

Not to expose things from rust, or other "heavyweight" things.

Things like "Is this mozilla::Span<const char16_t> in the Latin1 range" and "Convert UTF-16 from this mozilla::Span<const char16_t> to this mozilla::Span<uint8_t>" seem pretty close in theme to the kind of stuff we have in Utf8.h and TextUtils.h. It seems odd that the implementation detail of being backed by Rust resulted in the entry point C++ headers going elsewhere.

Yes, there are some .cpp files in MFBT, but they'll eventually move somewhere else (bug 1554062). Mozglue is also not the right place, because while it's common to gecko and js, it's not statically linked to them, so you can't go from gecko C++ to gecko rust via mozglue.

Where should I put C++ headers that refer to MFBT types and that I want to make available to 1) stand-alone SpiderMonkey, 2) SpiderMonkey as part of Gecko, and 3) the rest of Gecko?

Presumably, you want to generate C headers for encoding_c_mem with cbindgen anyways.

I intend to provide a C header along the crate, but a mozilla::Span-aware C++ header seems like something that should live in m-c somewhere. At present, mbft/ seems like the most obvious place.

Flags: needinfo?(mh+mozilla)

I don't see why we need to be so strict about the name matching the content.

Because almost everything uses MFBT. Everything, including things that aren't in libxul. And aren't necessarily able to use libxul.

Where should I put C++ headers that refer to MFBT types and that I want to make available to 1) stand-alone SpiderMonkey, 2) SpiderMonkey as part of Gecko, and 3) the rest of Gecko?

Anywhere that makes sense and is traversed by the build system in all these configurations. Preferably somewhere close to the rust code. Actually, why not in the encoding_c_mem crate, behind a "gecko" or "gecko_extensions" feature?

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #9)

Preferably somewhere close to the rust code. Actually, why not in the encoding_c_mem crate, behind a "gecko" or "gecko_extensions" feature?

Seems weird to ship a Gecko-specific header on crates.io, but OK.

Component: MFBT → Internationalization
Summary: Make SpiderMonkey-relevant parts of encoding_rs::mem available via MFBT → Make SpiderMonkey-relevant parts of encoding_rs::mem available to SpiderMonkey
Depends on: 1572364
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED

Note: This changeset does not yet make it possible to propagate the
simd-accel feature to encoding_rs in standalone SpiderMonkey builds.

(In reply to Henri Sivonen (:hsivonen) from comment #10)

(In reply to Mike Hommey [:glandium] from comment #9)

Preferably somewhere close to the rust code. Actually, why not in the encoding_c_mem crate, behind a "gecko" or "gecko_extensions" feature?

Seems weird to ship a Gecko-specific header on crates.io, but OK.

On second thought, I feel the need to push back on this. As I said on dev-platform, it's bad for discoverability if mozilla::IsAscii that takes '\0'-terminated const char* (currently in TextUtils.h) and mozilla::IsAscii that takes mozilla::Span<const char> (Rust-backed; currently in nsReadableUtils.h as IsASCII) aren't discoverable in the same place. I think which header these go into should depend on what kind of tasks the functions are meant for and not based on which function is .h-only, which is .cpp-backed, and which are Rust-backed.

So I think the C++ wrappers for the Rust FFI functions that this bug is about should go into mfbt/TextUtils.h. It doesn't seem like good use of time, though, if that approach would get r- (most of the churn that I don't want to perform only to get r- is changing the naming from Gecko case to MFBT case, i.e. s/ASCII/Ascii/, s/UTF16/Utf16/, etc.), so I feed the need to ask a preflight needinfo.

Putting the C++ wrappers behind

#if defined(MOZILLA_INTERNAL_API) || \
    (defined(MOZ_HAS_MOZGLUE) && defined(ENABLE_WASM_CRANELIFT))

in mfbt/TextUtils.h seems to work. (The ENABLE_WASM_CRANELIFT bit can go away once non-libxul SpiderMonkey starts building jsrust_shared unconditionally.) It will be not so nice-looking for non-SpiderMonkey, non-libxul users of MFBT, but I think we should optimize for the SpiderMonkey/Gecko-developers finding the SIMD-accelerated functions instead of risking developers reinventing the functions (likely without SIMD). (After all, non-SIMD versions of most of these functions take so little time to write that there's a particular risk of developers not searching exhaustively for existing faster implementations with the right semantics if they aren't offered in the obvious place.)

glandium, Waldo, does that seem like an approach I can expect r+ for?

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jwalden)

I'll ask this: why do TextUtils.h and Utf8.h have to be in MFBT in the first place?

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #13)

I'll ask this: why do TextUtils.h and Utf8.h have to be in MFBT in the first place?

Waldo can answer why, but in the mean time: Where else could they be?

Flags: needinfo?(mh+mozilla)

The answer hasn't changed since comment 9.

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #15)

The answer hasn't changed since comment 9.

Comment 9 suggests the Rust crate as the header location. Clearly, TextUtils.h and Utf8.h could not have originally have gone there. As for what to do now, it would be very impractical to have changes to TextUtils.h and Utf8.h travel through another repo plus crates.io.

Apart from "in" the crate directory, we don't have any place "close" to the Rust code.

As for creating a new non-vendored directory for infrastructural headers that are available to SpiderMonkey and Gecko, but not to crash reporter, software updater, etc., that would still pose the discoverability problem that SpiderMonkey and Gecko developers would have to look for functions based on the implementation approach of the functions rather than the task topic of the functions. It would be annoying if headers had to jump from mfbt/ to this new directory when the implementation approach changes. Also, #ifdefs under mfbt/ allow for things like Rust-backed fast implementations for SpiderMonkey and Gecko and naive C++ header-only implementations for other consumers for operations that can reasonably have a tiny naive implementation (e.g. "is ASCII"). It doesn't make sense to write header-only naive implementations for all of the functions just "for completeness", though, if there aren't use cases in crash reporter, etc.

You keep mentioning discoverability, but mfbt/ is not the single location where to find headers. In fact, when you see "mozilla/something" in a C++ file, it's very often not in mfbt/. Even js uses mozilla/ headers that are not in mfbt (few, though, compared to Gecko).

(In reply to Mike Hommey [:glandium] from comment #17)

You keep mentioning discoverability, but mfbt/ is not the single location where to find headers. In fact, when you see "mozilla/something" in a C++ file, it's very often not in mfbt/. Even js uses mozilla/ headers that are not in mfbt (few, though, compared to Gecko).

I'm not trying to solve "Which directory has mozilla/Foo.h?". What I want to solve is that when a SpiderMonkey or Gecko developer wants to check if a buffer is e.g. ASCII, they pick the SIMD-accelerated check instead of picking a non-accelerated check that the crash reporter would be eligible to use or writing their own non-accelerated version.

Concrete example: https://searchfox.org/mozilla-central/rev/3366c3d24f1c3818df37ec0818833bf085e41a53/netwerk/base/ProxyAutoConfig.cpp#725 is code that is allowed to use nsReadableUtils.h. Yet, it calls the non-SIMD-accelerated mozilla::IsValidUtf8 from Utf8.h instead of the SIMD-accelerated IsUTF8 from nsReadableUtils.h. We may need to have the two implementations (in case e.g. crash reporter wants to do the check without linking a bunch of Rust code), but I think it's bad that we have the two entry points, which makes it possible to pick the non-accelerated version for a caller that'd be eligible to call the accelerated version. (Maybe perf doesn't matter for proxy config specifically, but we should use the accelerated version when we have it.) I think it would be much better to have one entry point, in an obvious place (mfbt/ already has infrastructural code), and #ifdefs that pick the SIMD-accelerated implementation when we link the Rust code and pick the non-SIMD-accelerated C++ otherwise.

(And, again, I don't think we need to fill in the non-SIMD-accelerated C++ version ahead of time for every operation that I'm seeking to add here, but should fill the rest in only when non-Rust-linking users of mfbt/ need stuff.)

... and we're back to comment 13.

(In reply to Mike Hommey [:glandium] from comment #19)

... and we're back to comment 13.

So that e.g. crash reporter (or whatever in-tree code that doesn't link Rust) doesn't end up putting an "Is ASCII?" or "Is UTF-8?" function somewhere in the tree (likely under mfbt/!) where choice between that and whatever place TextUtils.h and Utf8.h would move to would be the same kind of trap as the duality of nsReadableUtils.h and mbft/ is now.

MFBT expands to a thing that says "templates", but that's a silly gimmick for an abbreviation. I've never understood it to be "just headers and inline stuff", and it's natural that it sometimes be .cpp files for functions that should only be defined once, out-of-line. (It might also sometime be nice to instantiate some highly-used templates in .cpp files too, e.g. Vector<void*> if we ever did the work to common up Vector instantiations with a casting veneer.)

Yeah, std::abs has multiple overloads declared in multiple headers and it's terrible. mozilla::IsAscii overloads of built-in types should all be declared in the same header. I'm not completely sanguine about making TextUtils.h depend on Span, but it's not the end of the world.

Putting IsUtf8 (however named) in Utf8.h seems correct to me, because it's clearly about UTF-8. Putting IsAscii (and IsNonAsciiLatin1) in TextUtils.h seems right to me because they are not directly about UTF-8.

My view of MFBT is that it's the one-stop shop for basic, general algorithms and utility functions and classes. I'd prefer if mozglue were just another consumer of it like JS and XPCOM and everything else is (those latter things indirectly through dependence on mozglue, if that's the easiest tack). But I try not to get too concerned about exactly how it all breaks down. I would also prefer if we didn't have multiple implementations of stuff in MFBT depending what the build context is, all else equal. But if it's a lot of work to make that happen, it's not critical. (Incidentally this is how IsValidUtf8 became its own thing, because I was just moving an implementation out of JS more or less wholesale, and having it adjacent to the UTF-8 code unit type was the sensible thing to do.)

Does that answer all the asked questions? Build-fu is really not my strong suit, and twenty comments of discussion leave me not 100% clear exactly what all the specific questions I need to answer are.

Flags: needinfo?(jwalden)

My view of MFBT is that it's the one-stop shop for basic, general algorithms and utility functions and classes.

Does that answer all the asked questions?

From my perspective, yes. Thank you. (Your answer matches my understanding of what MFBT is supposed to be.)

glandium, it seems to me that comment 21 also answers comment 13.

Flags: needinfo?(mh+mozilla)

It turns out that adding inline bool IsAscii(mozilla::Span<const char> aString); and inline bool IsAscii(mozilla::Span<const char16_t> aString); to the overload set of IsAscii such that passing an instance of nsACString or nsAString works due to autoconversions to Span in those classes is harder than I expected.

0:54.77 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/mozilla/TextUtils.h:85:13: error: no matching conversion for static_cast from 'nsTSubstring<char>' to 'UnsignedChar' (aka 'int')
 0:54.77   auto uc = static_cast<UnsignedChar>(aChar);
 0:54.77             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 0:54.77 /opt/Projects/gecko/intl/locale/MozLocale.cpp:22:29: note: in instantiation of function template specialization 'mozilla::IsAscii<nsTSubstring<char> >' requested here
 0:54.77   if (aLocale.IsEmpty() || !IsAscii(aLocale)) {
 0:54.77                             ^
 0:54.77 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:972:3: note: candidate template ignored: could not match 'mozilla::Span<unsigned char, 18446744073709551615>' against 'int'
 0:54.77   operator mozilla::Span<uint8_t>() {
 0:54.77   ^
 0:54.78 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:978:3: note: candidate template ignored: could not match 'mozilla::Span<const unsigned char, 18446744073709551615>' against 'int'
 0:54.78   operator mozilla::Span<const uint8_t>() const {
 0:54.78   ^
 0:54.78 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:947:3: note: candidate function
 0:54.78   operator mozilla::Span<char_type>() {
 0:54.78   ^
 0:54.78 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:951:3: note: candidate function
 0:54.78   operator mozilla::Span<const char_type>() const {
 0:54.78   ^
 0:54.79 In file included from /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/intl/locale/Unified_cpp_intl_locale0.cpp:20:
 0:54.79 /opt/Projects/gecko/intl/locale/MozLocale.cpp:22:37: error: calling a protected constructor of class 'nsTSubstring<char>'
 0:54.79   if (aLocale.IsEmpty() || !IsAscii(aLocale)) {
 0:54.79                                     ^
 0:54.79 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:1119:3: note: declared protected here
 0:54.79   nsTSubstring(const self_type& aStr)
 0:54.79   ^

This exceeds my template writing experience. How should I go about adding the span versions to the overload set such that passing XPCOM strings matches the span versions and not the integral version?

Flags: needinfo?(jwalden)

I think I'm just going to make IsAscii a non-template function and manually overload it for character types. (And adjust int32_t callers to use the char32_t overload instead.)

Flags: needinfo?(jwalden)

I commented out one static assert, because I don't understand why it no longer compiles when mozilla::IsAscii is an overloaded function rather than a template function.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=263577157c2c249a475859e76e2b1ff1ca29fe0a

Blocks: 1578339
Blocks: 1578650
Blocks: 1578677

(In reply to Henri Sivonen (:hsivonen) from comment #23)

This exceeds my template writing experience. How should I go about adding the span versions to the overload set such that passing XPCOM strings matches the span versions and not the integral version?

I'm sure it's work-through-able somehow, but yeah, defining all the character-type overloads separately is a fine alternative tack. (Probably better, even.)

(In reply to Henri Sivonen (:hsivonen) from comment #26)

I commented out one static assert, because I don't understand why it no longer compiles when mozilla::IsAscii is an overloaded function rather than a template function.

I think C++ doesn't quite understand how to figure out the overloading in that case, somehow. The auto and lambda tack I suggest should do the job. (You can't just use a lambda directly, because C++ with lambdas originally forbid them to appear in "unevaluated contexts" -- and MOZ_ASSERT includes a static assertion that acts upon the assert-expression to verify that the assert-expression isn't one of a few types that indicate a likely typo, e.g. accidentally passing in a function rather than a call of that function.)

(In reply to Jeff Walden [:Waldo] from comment #27)

I think C++ doesn't quite understand how to figure out the overloading in that case, somehow. The auto and lambda tack I suggest should do the job. (You can't just use a lambda directly, because C++ with lambdas originally forbid them to appear in "unevaluated contexts" -- and MOZ_ASSERT includes a static assertion that acts upon the assert-expression to verify that the assert-expression isn't one of a few types that indicate a likely typo, e.g. accidentally passing in a function rather than a call of that function.)

On the topic of asserting ASCIIness using lamba in that file. This failure is due to the input script having a lone non-ASCII byte that gets turned into a REPLACEMENT CHARACTER in a run of characters that the TokenStream wants to assert as ASCII:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=265926968&repo=try&lineNumber=1723

Which is a bit odd, since the patch didn't change how those bytes end up in TokenStream.

I need to debug this a bit more.

The bit == simple had gotten lost from the assertion in the patch.

Trying to get perfherder to show that I didn't regress cargo feature passing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b051131891b98d3b095ea7e40e3c9d2dd7cb0486

(In reply to Henri Sivonen (:hsivonen) from comment #31)

Trying to get perfherder to show that I didn't regress cargo feature passing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b051131891b98d3b095ea7e40e3c9d2dd7cb0486

Looks like the test results fluctuate a lot. This goes both ways in a manner that doesn't make any sense.

MFBT expands to a thing that says "templates", but that's a silly gimmick for an abbreviation. I've never understood it to be "just headers and inline stuff", and it's natural that it sometimes be .cpp files for functions that should only be defined once, out-of-line.

My view of MFBT is that it's the one-stop shop for basic, general algorithms and utility functions and classes. I'd prefer if mozglue were just another consumer of it like JS and XPCOM and everything else is (those latter things indirectly through dependence on mozglue, if that's the easiest tack).

Here's the problem: you can't ignore the reality of how things are built, and how they end up here and there. CPP files in MFBT end up in mozglue. IOW, they end up somewhere else than libxul. And that makes for a huge difference for multiple things:

  • exposing C++ APIs can actually be a PITA, where you need to find the right place where to put your MFBT_API. Sometimes on the class, sometimes on the method, sometimes somewhere else.
  • there's no occasion for the compiler to LTO-inline any of the code from MFBT's CPP files.
  • every new method or data exposed via a CPP file is a relocation at startup time.
  • everything that is in a MFBT CPP file and uses MOZ_CRASH/MOZ_ASSERT/MOZ_RELEASE_ASSERT, etc. doesn't print out a stack trace.

And that's not an exhaustive list.

And people don't realize the above limitations. Piling up more and more. At some point, that has to stop. I'm all for having a one-stop shop for basic, general algorithms and utility functions and classes, but MFBT is not it. Or if MFBT is it, it must only be used by mozjs and libxul. And nothing else. Not the firefox executable, not the updater executable, not the crashreporter executable, not the plugin container executable, not the gmp modules, etc.

One could think "oh, but we could statically link everything from MFBT CPP files to each of those"... except that would break everything that relies on globals that are unique to the process. Like Poisoning, Chaos mode, Crash annotations, and Record-replay. Sure, it's possible to accommodate for that, but part of doing that is to split MFBT, and it's not a one-stop shop anymore.

Flags: needinfo?(mh+mozilla)
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ccb280eb9023
part 1 - Move encoding_c and encoding_c_mem to jsrust_shared. r=glandium
https://hg.mozilla.org/integration/autoland/rev/c1a366edce9b
part 2 - Move C++ entry points to encoding_c_mem to mfbt/. r=jwalden

Considering the days between comment 22 and comment 33 and all the work I did in between, I went ahead and landed this despite comment 33.

I'm all for having a one-stop shop for basic, general algorithms and utility functions and classes, but MFBT is not it. Or if MFBT is it, it must only be used by mozjs and libxul.

It seems to me that the latter would be preferable if those are strictly the options.

And nothing else. Not the firefox executable, not the updater executable, not the crashreporter executable, not the plugin container executable, not the gmp modules, etc.

I think the scope of the issues is way beyond making the things available to SpiderMonkey that I made available to SpiderMonkey here. I think it would be useful to move this to dev-platform and to survey what parts of MFBT the firefox executable, the crash reported executable, the plugin container executable, and the GMP modules use.

Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/dd918656e4d7
Port bug 1490601 to comm-central: s/IsUTF8/IsUtf8/. rs=bustage-fix
Duplicate of this bug: 1396220
You need to log in before you can comment on or make changes to this bug.