Make SpiderMonkey-relevant parts of encoding_rs::mem available to SpiderMonkey
Categories
(Core :: Internationalization, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox71 | --- | fixed |
People
(Reporter: hsivonen, Assigned: hsivonen)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
Currently conversions between UTF-16 and Latin1 using mozilla::Span (i.e. without XPCOM string dependency) as well as the operation for asking if a mozilla::Span<const char16_t> is entirely in the Latin1 range are exposed via nsReadableUtils.h. These should be exposed via MFBT in order to be usable in SpiderMonkey.
Comment 1•6 years ago
|
||
That doesn't sound like a good idea. That would mean moving encoding_rs to mozglue.
Assignee | ||
Comment 2•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #1) > That doesn't sound like a good idea. That would mean moving encoding_rs to > mozglue. I take it that I've fundamentally misunderstood what SpiderMonkey links. I though that when built in the context of Firefox, SpiderMonkey was able to call into libxul and that there now was a gkrust replacement for standalone SpiderMonkey. What's the arrangement actually like and why can't SpiderMonkey call into code in gkrust?
Comment 3•6 years ago
|
||
Because there's no gkrust replacement for standalone Spidermonkey. There's a similar problem in bug 1469027. Also, afaik, building Firefox with spidermonkey as a shared library is still a supported setup, so spidermonkey can't call into libxul directly.
Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #3) > Because there's no gkrust replacement for standalone Spidermonkey. Oh, I thought we gained such a replacement a couple of weeks ago, but I guess not, as there's no Cargo.lock under js/. > There's a > similar problem in bug 1469027. Also, afaik, building Firefox with > spidermonkey as a shared library is still a supported setup, so spidermonkey > can't call into libxul directly. I commented there.
Assignee | ||
Comment 5•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #3)
Because there's no gkrust replacement for standalone Spidermonkey.
My understanding is that this has now been fixed
Assignee | ||
Comment 6•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #5)
(In reply to Mike Hommey [:glandium] from comment #3)
Because there's no gkrust replacement for standalone Spidermonkey.
My understanding is that this has now been fixed
Is my understanding correct? That is, is it now OK to make both jsrust_shared
and gkrust_shared
transitively depend on encoding_rs
so that a Firefox build ends up with one of encoding_rs
? So bug 1490593 could introduce an encoding_c_mem
crate depended on by jsrust_shared
and wrapped for C++ by MFBT and both SpiderMonkey and the rest of Gecko could use those MFBT APIs?
Comment 7•5 years ago
|
||
So bug 1490593 could introduce an encoding_c_mem crate depended on by jsrust_shared
Probably ; could be depended on by gkrust_shared too, presumably.
and wrapped for C++ by MFBT
No. MFBT is for templates. Not to expose things from rust, or other "heavyweight" things. Yes, there are some .cpp files in MFBT, but they'll eventually move somewhere else (bug 1554062). Mozglue is also not the right place, because while it's common to gecko and js, it's not statically linked to them, so you can't go from gecko C++ to gecko rust via mozglue. Presumably, you want to generate C headers for encoding_c_mem with cbindgen anyways.
Assignee | ||
Comment 8•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7)
So bug 1490593 could introduce an encoding_c_mem crate depended on by jsrust_shared
Probably ; could be depended on by gkrust_shared too, presumably.
Thanks.
and wrapped for C++ by MFBT
No. MFBT is for templates.
I don't see why we need to be so strict about the name matching the content.
Not to expose things from rust, or other "heavyweight" things.
Things like "Is this mozilla::Span<const char16_t>
in the Latin1 range" and "Convert UTF-16 from this mozilla::Span<const char16_t>
to this mozilla::Span<uint8_t>
" seem pretty close in theme to the kind of stuff we have in Utf8.h
and TextUtils.h
. It seems odd that the implementation detail of being backed by Rust resulted in the entry point C++ headers going elsewhere.
Yes, there are some .cpp files in MFBT, but they'll eventually move somewhere else (bug 1554062). Mozglue is also not the right place, because while it's common to gecko and js, it's not statically linked to them, so you can't go from gecko C++ to gecko rust via mozglue.
Where should I put C++ headers that refer to MFBT types and that I want to make available to 1) stand-alone SpiderMonkey, 2) SpiderMonkey as part of Gecko, and 3) the rest of Gecko?
Presumably, you want to generate C headers for encoding_c_mem with cbindgen anyways.
I intend to provide a C header along the crate, but a mozilla::Span
-aware C++ header seems like something that should live in m-c somewhere. At present, mbft/
seems like the most obvious place.
Comment 9•5 years ago
|
||
I don't see why we need to be so strict about the name matching the content.
Because almost everything uses MFBT. Everything, including things that aren't in libxul. And aren't necessarily able to use libxul.
Where should I put C++ headers that refer to MFBT types and that I want to make available to 1) stand-alone SpiderMonkey, 2) SpiderMonkey as part of Gecko, and 3) the rest of Gecko?
Anywhere that makes sense and is traversed by the build system in all these configurations. Preferably somewhere close to the rust code. Actually, why not in the encoding_c_mem crate, behind a "gecko" or "gecko_extensions" feature?
Assignee | ||
Comment 10•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #9)
Preferably somewhere close to the rust code. Actually, why not in the encoding_c_mem crate, behind a "gecko" or "gecko_extensions" feature?
Seems weird to ship a Gecko-specific header on crates.io, but OK.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 11•5 years ago
|
||
Note: This changeset does not yet make it possible to propagate the
simd-accel feature to encoding_rs in standalone SpiderMonkey builds.
Assignee | ||
Comment 12•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #10)
(In reply to Mike Hommey [:glandium] from comment #9)
Preferably somewhere close to the rust code. Actually, why not in the encoding_c_mem crate, behind a "gecko" or "gecko_extensions" feature?
Seems weird to ship a Gecko-specific header on crates.io, but OK.
On second thought, I feel the need to push back on this. As I said on dev-platform, it's bad for discoverability if mozilla::IsAscii
that takes '\0'
-terminated const char*
(currently in TextUtils.h
) and mozilla::IsAscii
that takes mozilla::Span<const char>
(Rust-backed; currently in nsReadableUtils.h
as IsASCII
) aren't discoverable in the same place. I think which header these go into should depend on what kind of tasks the functions are meant for and not based on which function is .h
-only, which is .cpp
-backed, and which are Rust-backed.
So I think the C++ wrappers for the Rust FFI functions that this bug is about should go into mfbt/TextUtils.h
. It doesn't seem like good use of time, though, if that approach would get r- (most of the churn that I don't want to perform only to get r- is changing the naming from Gecko case to MFBT case, i.e. s/ASCII/Ascii/, s/UTF16/Utf16/, etc.), so I feed the need to ask a preflight needinfo.
Putting the C++ wrappers behind
#if defined(MOZILLA_INTERNAL_API) || \
(defined(MOZ_HAS_MOZGLUE) && defined(ENABLE_WASM_CRANELIFT))
in mfbt/TextUtils.h
seems to work. (The ENABLE_WASM_CRANELIFT
bit can go away once non-libxul SpiderMonkey starts building jsrust_shared
unconditionally.) It will be not so nice-looking for non-SpiderMonkey, non-libxul users of MFBT, but I think we should optimize for the SpiderMonkey/Gecko-developers finding the SIMD-accelerated functions instead of risking developers reinventing the functions (likely without SIMD). (After all, non-SIMD versions of most of these functions take so little time to write that there's a particular risk of developers not searching exhaustively for existing faster implementations with the right semantics if they aren't offered in the obvious place.)
glandium, Waldo, does that seem like an approach I can expect r+ for?
Comment 13•5 years ago
|
||
I'll ask this: why do TextUtils.h and Utf8.h have to be in MFBT in the first place?
Assignee | ||
Comment 14•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #13)
I'll ask this: why do TextUtils.h and Utf8.h have to be in MFBT in the first place?
Waldo can answer why, but in the mean time: Where else could they be?
Assignee | ||
Comment 16•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #15)
The answer hasn't changed since comment 9.
Comment 9 suggests the Rust crate as the header location. Clearly, TextUtils.h
and Utf8.h
could not have originally have gone there. As for what to do now, it would be very impractical to have changes to TextUtils.h
and Utf8.h
travel through another repo plus crates.io.
Apart from "in" the crate directory, we don't have any place "close" to the Rust code.
As for creating a new non-vendored directory for infrastructural headers that are available to SpiderMonkey and Gecko, but not to crash reporter, software updater, etc., that would still pose the discoverability problem that SpiderMonkey and Gecko developers would have to look for functions based on the implementation approach of the functions rather than the task topic of the functions. It would be annoying if headers had to jump from mfbt/
to this new directory when the implementation approach changes. Also, #ifdef
s under mfbt/
allow for things like Rust-backed fast implementations for SpiderMonkey and Gecko and naive C++ header-only implementations for other consumers for operations that can reasonably have a tiny naive implementation (e.g. "is ASCII"). It doesn't make sense to write header-only naive implementations for all of the functions just "for completeness", though, if there aren't use cases in crash reporter, etc.
Comment 17•5 years ago
|
||
You keep mentioning discoverability, but mfbt/ is not the single location where to find headers. In fact, when you see "mozilla/something" in a C++ file, it's very often not in mfbt/. Even js uses mozilla/ headers that are not in mfbt (few, though, compared to Gecko).
Assignee | ||
Comment 18•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #17)
You keep mentioning discoverability, but mfbt/ is not the single location where to find headers. In fact, when you see "mozilla/something" in a C++ file, it's very often not in mfbt/. Even js uses mozilla/ headers that are not in mfbt (few, though, compared to Gecko).
I'm not trying to solve "Which directory has mozilla/Foo.h
?". What I want to solve is that when a SpiderMonkey or Gecko developer wants to check if a buffer is e.g. ASCII, they pick the SIMD-accelerated check instead of picking a non-accelerated check that the crash reporter would be eligible to use or writing their own non-accelerated version.
Concrete example: https://searchfox.org/mozilla-central/rev/3366c3d24f1c3818df37ec0818833bf085e41a53/netwerk/base/ProxyAutoConfig.cpp#725 is code that is allowed to use nsReadableUtils.h
. Yet, it calls the non-SIMD-accelerated mozilla::IsValidUtf8
from Utf8.h
instead of the SIMD-accelerated IsUTF8
from nsReadableUtils.h
. We may need to have the two implementations (in case e.g. crash reporter wants to do the check without linking a bunch of Rust code), but I think it's bad that we have the two entry points, which makes it possible to pick the non-accelerated version for a caller that'd be eligible to call the accelerated version. (Maybe perf doesn't matter for proxy config specifically, but we should use the accelerated version when we have it.) I think it would be much better to have one entry point, in an obvious place (mfbt/
already has infrastructural code), and #ifdef
s that pick the SIMD-accelerated implementation when we link the Rust code and pick the non-SIMD-accelerated C++ otherwise.
(And, again, I don't think we need to fill in the non-SIMD-accelerated C++ version ahead of time for every operation that I'm seeking to add here, but should fill the rest in only when non-Rust-linking users of mfbt/
need stuff.)
Comment 19•5 years ago
|
||
... and we're back to comment 13.
Assignee | ||
Comment 20•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #19)
... and we're back to comment 13.
So that e.g. crash reporter (or whatever in-tree code that doesn't link Rust) doesn't end up putting an "Is ASCII?" or "Is UTF-8?" function somewhere in the tree (likely under mfbt/
!) where choice between that and whatever place TextUtils.h
and Utf8.h
would move to would be the same kind of trap as the duality of nsReadableUtils.h
and mbft/
is now.
Comment 21•5 years ago
|
||
MFBT expands to a thing that says "templates", but that's a silly gimmick for an abbreviation. I've never understood it to be "just headers and inline stuff", and it's natural that it sometimes be .cpp
files for functions that should only be defined once, out-of-line. (It might also sometime be nice to instantiate some highly-used templates in .cpp
files too, e.g. Vector<void*>
if we ever did the work to common up Vector
instantiations with a casting veneer.)
Yeah, std::abs
has multiple overloads declared in multiple headers and it's terrible. mozilla::IsAscii
overloads of built-in types should all be declared in the same header. I'm not completely sanguine about making TextUtils.h
depend on Span
, but it's not the end of the world.
Putting IsUtf8
(however named) in Utf8.h
seems correct to me, because it's clearly about UTF-8. Putting IsAscii
(and IsNonAsciiLatin1
) in TextUtils.h
seems right to me because they are not directly about UTF-8.
My view of MFBT is that it's the one-stop shop for basic, general algorithms and utility functions and classes. I'd prefer if mozglue were just another consumer of it like JS and XPCOM and everything else is (those latter things indirectly through dependence on mozglue, if that's the easiest tack). But I try not to get too concerned about exactly how it all breaks down. I would also prefer if we didn't have multiple implementations of stuff in MFBT depending what the build context is, all else equal. But if it's a lot of work to make that happen, it's not critical. (Incidentally this is how IsValidUtf8
became its own thing, because I was just moving an implementation out of JS more or less wholesale, and having it adjacent to the UTF-8 code unit type was the sensible thing to do.)
Does that answer all the asked questions? Build-fu is really not my strong suit, and twenty comments of discussion leave me not 100% clear exactly what all the specific questions I need to answer are.
Assignee | ||
Comment 22•5 years ago
|
||
My view of MFBT is that it's the one-stop shop for basic, general algorithms and utility functions and classes.
Does that answer all the asked questions?
From my perspective, yes. Thank you. (Your answer matches my understanding of what MFBT is supposed to be.)
glandium, it seems to me that comment 21 also answers comment 13.
Assignee | ||
Comment 23•5 years ago
|
||
It turns out that adding inline bool IsAscii(mozilla::Span<const char> aString);
and inline bool IsAscii(mozilla::Span<const char16_t> aString);
to the overload set of IsAscii
such that passing an instance of nsACString
or nsAString
works due to autoconversions to Span
in those classes is harder than I expected.
0:54.77 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/mozilla/TextUtils.h:85:13: error: no matching conversion for static_cast from 'nsTSubstring<char>' to 'UnsignedChar' (aka 'int')
0:54.77 auto uc = static_cast<UnsignedChar>(aChar);
0:54.77 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0:54.77 /opt/Projects/gecko/intl/locale/MozLocale.cpp:22:29: note: in instantiation of function template specialization 'mozilla::IsAscii<nsTSubstring<char> >' requested here
0:54.77 if (aLocale.IsEmpty() || !IsAscii(aLocale)) {
0:54.77 ^
0:54.77 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:972:3: note: candidate template ignored: could not match 'mozilla::Span<unsigned char, 18446744073709551615>' against 'int'
0:54.77 operator mozilla::Span<uint8_t>() {
0:54.77 ^
0:54.78 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:978:3: note: candidate template ignored: could not match 'mozilla::Span<const unsigned char, 18446744073709551615>' against 'int'
0:54.78 operator mozilla::Span<const uint8_t>() const {
0:54.78 ^
0:54.78 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:947:3: note: candidate function
0:54.78 operator mozilla::Span<char_type>() {
0:54.78 ^
0:54.78 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:951:3: note: candidate function
0:54.78 operator mozilla::Span<const char_type>() const {
0:54.78 ^
0:54.79 In file included from /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/intl/locale/Unified_cpp_intl_locale0.cpp:20:
0:54.79 /opt/Projects/gecko/intl/locale/MozLocale.cpp:22:37: error: calling a protected constructor of class 'nsTSubstring<char>'
0:54.79 if (aLocale.IsEmpty() || !IsAscii(aLocale)) {
0:54.79 ^
0:54.79 /opt/Projects/gecko/obj-x86_64-pc-linux-gnu/dist/include/nsTSubstring.h:1119:3: note: declared protected here
0:54.79 nsTSubstring(const self_type& aStr)
0:54.79 ^
This exceeds my template writing experience. How should I go about adding the span versions to the overload set such that passing XPCOM strings matches the span versions and not the integral version?
Assignee | ||
Comment 24•5 years ago
|
||
I think I'm just going to make IsAscii
a non-template function and manually overload it for character types. (And adjust int32_t callers to use the char32_t overload instead.)
Assignee | ||
Comment 25•5 years ago
|
||
Assignee | ||
Comment 26•5 years ago
|
||
I commented out one static assert, because I don't understand why it no longer compiles when mozilla::IsAscii
is an overloaded function rather than a template function.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=263577157c2c249a475859e76e2b1ff1ca29fe0a
Comment 27•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #23)
This exceeds my template writing experience. How should I go about adding the span versions to the overload set such that passing XPCOM strings matches the span versions and not the integral version?
I'm sure it's work-through-able somehow, but yeah, defining all the character-type overloads separately is a fine alternative tack. (Probably better, even.)
(In reply to Henri Sivonen (:hsivonen) from comment #26)
I commented out one static assert, because I don't understand why it no longer compiles when
mozilla::IsAscii
is an overloaded function rather than a template function.
I think C++ doesn't quite understand how to figure out the overloading in that case, somehow. The auto
and lambda tack I suggest should do the job. (You can't just use a lambda directly, because C++ with lambdas originally forbid them to appear in "unevaluated contexts" -- and MOZ_ASSERT
includes a static assertion that acts upon the assert-expression to verify that the assert-expression isn't one of a few types that indicate a likely typo, e.g. accidentally passing in a function rather than a call of that function.)
Assignee | ||
Comment 28•5 years ago
|
||
(In reply to Jeff Walden [:Waldo] from comment #27)
I think C++ doesn't quite understand how to figure out the overloading in that case, somehow. The
auto
and lambda tack I suggest should do the job. (You can't just use a lambda directly, because C++ with lambdas originally forbid them to appear in "unevaluated contexts" -- andMOZ_ASSERT
includes a static assertion that acts upon the assert-expression to verify that the assert-expression isn't one of a few types that indicate a likely typo, e.g. accidentally passing in a function rather than a call of that function.)
On the topic of asserting ASCIIness using lamba in that file. This failure is due to the input script having a lone non-ASCII byte that gets turned into a REPLACEMENT CHARACTER in a run of characters that the TokenStream
wants to assert as ASCII:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=265926968&repo=try&lineNumber=1723
Which is a bit odd, since the patch didn't change how those bytes end up in TokenStream
.
I need to debug this a bit more.
Assignee | ||
Comment 29•5 years ago
|
||
The bit == simple
had gotten lost from the assertion in the patch.
Assignee | ||
Comment 30•5 years ago
|
||
Try run with rebase and fixed assertion:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7264e180bad0638a642cf3d23c9a5523a742931
Assignee | ||
Comment 31•5 years ago
|
||
Trying to get perfherder to show that I didn't regress cargo feature passing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b051131891b98d3b095ea7e40e3c9d2dd7cb0486
Assignee | ||
Comment 32•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #31)
Trying to get perfherder to show that I didn't regress cargo feature passing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b051131891b98d3b095ea7e40e3c9d2dd7cb0486
Looks like the test results fluctuate a lot. This goes both ways in a manner that doesn't make any sense.
Comment 33•5 years ago
|
||
MFBT expands to a thing that says "templates", but that's a silly gimmick for an abbreviation. I've never understood it to be "just headers and inline stuff", and it's natural that it sometimes be .cpp files for functions that should only be defined once, out-of-line.
My view of MFBT is that it's the one-stop shop for basic, general algorithms and utility functions and classes. I'd prefer if mozglue were just another consumer of it like JS and XPCOM and everything else is (those latter things indirectly through dependence on mozglue, if that's the easiest tack).
Here's the problem: you can't ignore the reality of how things are built, and how they end up here and there. CPP files in MFBT end up in mozglue. IOW, they end up somewhere else than libxul. And that makes for a huge difference for multiple things:
- exposing C++ APIs can actually be a PITA, where you need to find the right place where to put your MFBT_API. Sometimes on the class, sometimes on the method, sometimes somewhere else.
- there's no occasion for the compiler to LTO-inline any of the code from MFBT's CPP files.
- every new method or data exposed via a CPP file is a relocation at startup time.
- everything that is in a MFBT CPP file and uses MOZ_CRASH/MOZ_ASSERT/MOZ_RELEASE_ASSERT, etc. doesn't print out a stack trace.
And that's not an exhaustive list.
And people don't realize the above limitations. Piling up more and more. At some point, that has to stop. I'm all for having a one-stop shop for basic, general algorithms and utility functions and classes, but MFBT is not it. Or if MFBT is it, it must only be used by mozjs and libxul. And nothing else. Not the firefox executable, not the updater executable, not the crashreporter executable, not the plugin container executable, not the gmp modules, etc.
One could think "oh, but we could statically link everything from MFBT CPP files to each of those"... except that would break everything that relies on globals that are unique to the process. Like Poisoning, Chaos mode, Crash annotations, and Record-replay. Sure, it's possible to accommodate for that, but part of doing that is to split MFBT, and it's not a one-stop shop anymore.
Comment 34•5 years ago
|
||
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ccb280eb9023 part 1 - Move encoding_c and encoding_c_mem to jsrust_shared. r=glandium https://hg.mozilla.org/integration/autoland/rev/c1a366edce9b part 2 - Move C++ entry points to encoding_c_mem to mfbt/. r=jwalden
Assignee | ||
Comment 35•5 years ago
|
||
Considering the days between comment 22 and comment 33 and all the work I did in between, I went ahead and landed this despite comment 33.
I'm all for having a one-stop shop for basic, general algorithms and utility functions and classes, but MFBT is not it. Or if MFBT is it, it must only be used by mozjs and libxul.
It seems to me that the latter would be preferable if those are strictly the options.
And nothing else. Not the firefox executable, not the updater executable, not the crashreporter executable, not the plugin container executable, not the gmp modules, etc.
I think the scope of the issues is way beyond making the things available to SpiderMonkey that I made available to SpiderMonkey here. I think it would be useful to move this to dev-platform and to survey what parts of MFBT the firefox executable, the crash reported executable, the plugin container executable, and the GMP modules use.
Comment 36•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ccb280eb9023
https://hg.mozilla.org/mozilla-central/rev/c1a366edce9b
Comment 37•5 years ago
|
||
Pushed by mozilla@jorgk.com: https://hg.mozilla.org/comm-central/rev/dd918656e4d7 Port bug 1490601 to comm-central: s/IsUTF8/IsUtf8/. rs=bustage-fix
Description
•