Open Bug 497204 Opened 15 years ago Updated 2 years ago

Reorganize and consolidate (without rewriting, as much as possible) XPCOM string encoding and decoding code

Categories

(Core :: XPCOM, defect)

defect

Tracking

()

People

(Reporter: Waldo, Unassigned)

Details

(Keywords: sec-want, Whiteboard: [sg:want?])

Our string encoding and decoding APIs in XPCOM are a mess.  Encoding and decoding are spread across several different files and directories, and half a dozen or so people (including a few who would be expected to know the code and weren't just fixing bugs in passing) have had problems making sure fixes address all the relevant pieces of code.  We should consolidate all such code into one or two files in a single location so that the code is easier to understand, simpler to fix, easier to test, and less prone to bugs through incomplete fixes.  We should not be interpreting UTF-8 or UTF-16 data in XPCOM in more than one file per encoding, if at all possible.
Component: XPCOM → String
QA Contact: xpcom → string
Whiteboard: [sg:want?]
See also the UTF8 conversions in intl/uconv where there also some SSE and ARM optimizations that should be used here.
Also all the UTF* handling first does a calculate of the length walking all characters, and then does a conversion.
With all due respect, I didn't file this bug to optimize the code.  I filed it because the current code is scattered, disorganized, and duplicative; that it might not be hyper-efficient is, to me, a much smaller concern than its over-complexity or the security bugs that complexity has engendered.  Optimize after the code's cleaned up, or if that's not your cup of tea optimize the current code in separate bugs -- optimization is not a goal of this bug, except insofar as codesize reduction may have minor performance effects at the margin.
Filed bug 1402247 about rewriting the internals without reorganizing the API.
:njn would you consider this as having been accomplished with the recentish string reorg?
Flags: needinfo?(n.nethercote)
Hmm, not especially; I suspect there's plenty more that could be improved. erahm, what do you think?
Flags: needinfo?(n.nethercote) → needinfo?(erahm)
xpcom is still a bit scattered, a quick scan comes up with:

	xpcom/io/nsUnicharInputStream.cpp
	xpcom/string/nsReadableUtils.cpp (+SSE2 helper)
	xpcom/string/nsString.h
	xpcom/string/nsUTF8Utils.h (+SSE2/NEON helpers)

I'm not sure what the situation was 9 years ago, this might be good enough particularly w/ hsivonen's push for switching everything to encoding_rs.

And of course there's a fair amount of somewhat duplicated logic spread around our codebase, for example:

        encoding_rs
	ipc/glue/StringUtil.cpp
	gfx/cairo/cairo/src/cairo-unicode.c
	icu
	js/src/vm/CharacterEncoding.cpp
	security/sandbox/chromium/base/strings/utf_string_conversions.cc
	toolkit/crashreporter/google-breakpad/src/common/string_conversion.cc
	intl/Encoding.h <- looks like encoding_rs now
	intl/uconv <- ditto
	dom/base/nsTextFragment.h (SSE2 helper)

expat has some conversion stuff as well, but we thankfully don't use it. There also appear to be some JS implementations (TextEncoder as well as other ad hoc converters).
Flags: needinfo?(erahm)
Bug 1402247 is very much under active development. It retains APIs while swapping out the internals. In the interest of avoiding a lot of useless work and merge conflict churn, I'd appreciate it if work on this bug could be postponed until bug 1402247 has landed.

As for the original bug description here, bug 1402247 moves most UTF-8 and UTF-16 into encoding_rs, but leaves (rewritten) iteration by scalar value (UTF8CharEnumerator::NextChar() and UTF16CharEnumerator::NextChar()) in C++ to allow them to be inlined into C++ callers. (Callers for UTF8CharEnumerator::NextChar() are so few that inlining seems OK and allows for the removal of duplicate ASCII optimizations from the callers. UTF16CharEnumerator::NextChar() can be rewritten in such a simple way that inlining it becomes OK.)

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #6)
> xpcom is still a bit scattered, a quick scan comes up with:
> 
> 	xpcom/io/nsUnicharInputStream.cpp

Going away in bug 1445692.

>     	xpcom/string/nsReadableUtils.cpp (+SSE2 helper)
> 	xpcom/string/nsString.h
> 	xpcom/string/nsUTF8Utils.h (+SSE2/NEON helpers)

On track to moving to encoding_rs.

>   	ipc/glue/StringUtil.cpp
> 	security/sandbox/chromium/base/strings/utf_string_conversions.cc

These are sad and arise from the use of std::wstring, which isn't portable. I wish we migrated the code either to XPCOM strings or at least to portable std:: alternatives (either std::string holding UTF-8 on all platforms or std::u16string holding UTF-16 on all platforms).

> 	gfx/cairo/cairo/src/cairo-unicode.c
> 	icu

These probably need to stay the way they are for easy sync with the upstream.

> 	js/src/vm/CharacterEncoding.cpp

I'd like to move this over to encoding_rs but it's pending on SpiderMonkey getting comfortable with Rust dependencies.

> 	toolkit/crashreporter/google-breakpad/src/common/string_conversion.cc

This probably needs to stay separate in order to enable easy sync with the upstream. (It looks like this is a copy of old ICU code that was replaced in the latest ICU in response to https://hsivonen.fi/broken-utf-8/ .)

> 	dom/base/nsTextFragment.h (SSE2 helper)

On track to moving to encoding_rs.

> expat has some conversion stuff as well, but we thankfully don't use it.
> There also appear to be some JS implementations (TextEncoder as well as
> other ad hoc converters).

TextEncoder and TextDecoder are wrappers for encoding_rs.
(In reply to Henri Sivonen (:hsivonen) from comment #7)
> These are sad and arise from the use of std::wstring, which isn't portable.
> I wish we migrated the code either to XPCOM strings or at least to portable
> std:: alternatives (either std::string holding UTF-8 on all platforms or
> std::u16string holding UTF-16 on all platforms).

Filed bug 1447674.
Component: String → XPCOM
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.