Closed Bug 541245 Opened 10 years ago Closed 10 years ago

UTF8 <-> UTF16 conversion/comparison code is buggy/inconsistent

Categories

(Core :: String, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla1.9.3a2

People

(Reporter: sicking, Assigned: sicking)

References

(Depends on 1 open bug)

Details

Attachments

(5 files, 1 obsolete file)

There's a few problems with the current UTF8/UTF16 conversion/comparison code.

* Converting UTF8->UTF16 generally converts overlong UTF8 sequences to UTF16 replacement characters. This is good to avoid security issues by 'hiding' characters in an overlong encoded sequence. However CompareUTF8toUTF16 does not do the same thing.

* There's a bug in the UTF8->UTF16 conversion code that makes us measure the length of the resulting UTF8 code if a 4-byte UTF8 sequence is overlong. This results in us truncating the result.

* The UTF8->UTF16 conversion code does error checking by checking that the measured length is the same as the actual conversion length. It's unclear that this catches all errors.

* There's no tests


Patches coming up
This fixes the bug in the length calculation. It also simplifies error checking by checking the existing error flag rather than comparing lengths.

Also removes some left-over code from when we had fragmented strings. Really we don't need to use character sinks at all since we don't have fragmented strings, but I'd rather do that separately.
Assignee: nobody → jonas
Attachment #422868 - Flags: review?(dbaron)
Move replacement character handling into the UTF8 enumerator so that it happens in the conversion and comparison code likewise.

Also remove some dead UTF8 enumerator code.
Attachment #422872 - Flags: review?(dbaron)
Something went wrong in the diffing. This is the full patch.
Attachment #422872 - Attachment is obsolete: true
Attachment #422893 - Flags: review?(dbaron)
Attachment #422872 - Flags: review?(dbaron)
Comment on attachment 422868 [details] [diff] [review]
Part1: Fix UTF8 -> UTF16 conversion code

Simon, would you be able to review these?
Attachment #422868 - Flags: review?(dbaron) → review?(smontagu)
Attachment #422873 - Flags: review?(dbaron) → review?(smontagu)
Attachment #422893 - Flags: review?(dbaron) → review?(smontagu)
I don't think I can review these, because I'm not an XPCOM peer.
I'm more than happy to delegate the reviews here.
Just to be clear (sorry, ESL), you're happy to let Simon review? Or to review yourself?
To let smontagu review.
Comment on attachment 422868 [details] [diff] [review]
Part1: Fix UTF8 -> UTF16 conversion code

>+++ b/xpcom/string/public/nsUTF8Utils.h
>@@ -484,28 +486,45 @@ class CalculateUTF8Length

>+                // The below code therefor only adds 1 to mLength if the UTF8
>+                // data will produce a decoded character which is greater than
>+                // or equal to 0x010000 and less than 0x0110000.
>+                // It doesn't matter what to do the case where p + 4 > end
>+                // since no UTF16 characters will be written in that case by
>+                // ConvertUTF8toUTF16
>+
>+                if (p + 4 <= end) {
>+                  PRUint32 c = ((PRUint32)(p[0] & 0x07)) << 6 |
>+                               ((PRUint32)(p[1] & 0x30));
>+                  if (c >= 0x010 && c < 0x110)
>+                    ++mLength;

This bit-twiddling looks right to me, but it's obscure to say the least. Can you document it a bit more so that it's comprehensible without looking up the UTF-8 RFC?
Attachment #422868 - Flags: review?(smontagu) → review+
Comment on attachment 422868 [details] [diff] [review]
Part1: Fix UTF8 -> UTF16 conversion code

Plus some nits to the comments:

>@@ -484,28 +486,45 @@ class CalculateUTF8Length
>+                // The below code therefor only adds 1 to mLength if the UTF8
"The code below therefore"

>+                // data will produce a decoded character which is greater than
>+                // or equal to 0x010000 and less than 0x0110000.
>+                // It doesn't matter what to do the case where p + 4 > end
"to do in the case"
Comment on attachment 422893 [details] [diff] [review]
Part2: Fix UTF8 <-> UTF16 comparison code

>+++ b/xpcom/string/public/nsUTF8Utils.h

>+      else if ( ucs4 >= 0xD800 &&
>+                (ucs4 <= 0xDFFF || ucs4 == 0xFFFE || ucs4 == 0xFFFF ||
>+                 ucs4 >= UCS_END))
>+        {
>+          // Surrogates and prohibited characters
>+          ucs4 = UCS2_REPLACEMENT_CHAR;
>+        }
>+

Firstly, shouldn't the condition be
 if ((ucs4 >= 0xD800 && ucs4 <= 0xDFFF) ||
      ucs4 == 0xFFFE || ucs4 == 0xFFFF || ucs4 >= UCS_END)
? (Even if you say that they are logically equivalent and your way is more optimized, I still think it's confusing)

Secondly, the prohibited characters should include --FFFE and --FFFF in all 17 planes and also 0xFDD0 - 0xFDEF. I suggest defining IS_NONCHARACTER in nsCharTraits.h to include all of those, adding a test for IS_NONCHARACTER to IS_VALID_CHAR, and then using ENSURE_VALID_CHAR in your code.
Attachment #422873 - Flags: review?(smontagu) → review+
(In reply to comment #12)
> (From update of attachment 422893 [details] [diff] [review])
> >+++ b/xpcom/string/public/nsUTF8Utils.h
> 
> >+      else if ( ucs4 >= 0xD800 &&
> >+                (ucs4 <= 0xDFFF || ucs4 == 0xFFFE || ucs4 == 0xFFFF ||
> >+                 ucs4 >= UCS_END))
> >+        {
> >+          // Surrogates and prohibited characters
> >+          ucs4 = UCS2_REPLACEMENT_CHAR;
> >+        }
> >+
> 
> Firstly, shouldn't the condition be
>  if ((ucs4 >= 0xD800 && ucs4 <= 0xDFFF) ||
>       ucs4 == 0xFFFE || ucs4 == 0xFFFF || ucs4 >= UCS_END)
> ? (Even if you say that they are logically equivalent and your way is more
> optimized, I still think it's confusing)

Why is it confusing? There's no need to check if ucs4 == 0xFFFE if we know it's less than 0xD800. I do think performance is quite important here since this is a test we run for every single character during conversion. So just doing one test rather than four in the common case seems like a decent win.

> Secondly, the prohibited characters should include --FFFE and --FFFF in all 17
> planes and also 0xFDD0 - 0xFDEF. I suggest defining IS_NONCHARACTER in
> nsCharTraits.h to include all of those, adding a test for IS_NONCHARACTER to
> IS_VALID_CHAR, and then using ENSURE_VALID_CHAR in your code.

Hmm.. the old code didn't do this, but I guess we could add this if you are sure this is desired behavior.
So is 0xFDD0 - 0xFDEF also illegal in all 17 planes? Or just the first one?
(In reply to comment #13)
> Why is it confusing? There's no need to check if ucs4 == 0xFFFE if we know it's
> less than 0xD800. I do think performance is quite important here since this is
> a test we run for every single character during conversion. So just doing one
> test rather than four in the common case seems like a decent win.

OK, fair enough.

> Hmm.. the old code didn't do this, but I guess we could add this if you are
> sure this is desired behavior.

Yes, I think the old code is wrong and we should be checking for all noncharacters.

(In reply to comment #14)
> So is 0xFDD0 - 0xFDEF also illegal in all 17 planes? Or just the first one?

Only in plane 0. See section 16.7 in http://www.unicode.org/versions/Unicode5.2.0/ch16.pdf
So I started coding up covering all illegal characters when converting between UTF8<->UTF16. It turns into a non trivial amount of work unfortunately since there's now a lot more cases where we need to check for illegal characters.

So not only do we need to check for more values every time we check for illegal characters. We also have to check in more places since for example a UTF16 surrogate pair can now result in an invalid character.

Is there a reason we're checking for these illegal characters at all while doing conversion? Currently the UTF16 -> UTF8 conversion code doesn't look for any illegal values other than broken surrogate pairs, and we don't seem to be hurting from this.
I think this is what it'd look like, but need to write tests to make sure it's correct. Let me know if this is something we really want to do.
Attachment #425617 - Flags: review?(smontagu)
Well in principle I think we should be doing either all or nothing, i.e. if we check for 0xFFFE and 0xFFFF we should check for the other non-characters as well. Also since we use these converters often to produce strings which we then pass to platform APIs, we really should be rigorous about "not exchanging non-characters between processes"
I agree about consistency, but I'd prefer to remove the 0xFFFE/0xFFFF checks.

Aren't we already in the situation where we're passing noncharacters to various APIs? From javascript we get UTF16, which I'd imagine we then either pass directly to platform APIs, or convert to UTF8 and pass to platform APIs. In neither case we currently check for noncharacters since the UTF16->UTF8 conversion code doesn't do any such checks.
I think this is a better way to deal with illegal characters in the conversion code. This makes us consistent between utf8->utf16 and utf16->utf8 conversion code in that neither converts illegal characters to replacement characters.

If we really want to ensure that we never pass illegal utf characters to various platform APIs, I think trying to enforce that in the conversion code is the wrong strategy.

Either we need to make sure that all of our strings are proper utf8/utf16, in which case we don't need any checks in the conversion code (though we could add assertions). Or we ensure that we check for illegal characters right before passing strings to platform APIs.

Checking in conversion code is IMHO the worst of both worlds since we neither can rely on a string being correct (it might have never been converted), but we're still paying the cost in the conversion code.
Attachment #428354 - Flags: review?(smontagu)
Comment on attachment 428354 [details] [diff] [review]
Part4: Make illegal char handling consistent

>diff --git a/xpcom/string/public/nsUTF8Utils.h b/xpcom/string/public/nsUTF8Utils.h
>--- a/xpcom/string/public/nsUTF8Utils.h
>+++ b/xpcom/string/public/nsUTF8Utils.h
>@@ -129,18 +129,17 @@ public:

>       else if ( ucs4 >= 0xD800 &&
>-                (ucs4 <= 0xDFFF || ucs4 == 0xFFFE || ucs4 == 0xFFFF ||
>-                 ucs4 >= UCS_END))
>+                (ucs4 <= 0xDFFF || ucs4 >= UCS_END))
>         {
>           // Surrogates and prohibited characters
>           ucs4 = UCS2_REPLACEMENT_CHAR;
>         }
> 

"Surrogates and code points outside the Unicode range"
Attachment #428354 - Flags: review?(smontagu) → review+
Attachment #422893 - Flags: review?(smontagu) → review+
(In reply to comment #20)

> If we really want to ensure that we never pass illegal utf characters to
> various platform APIs, I think trying to enforce that in the conversion code is
> the wrong strategy.

I have argued exactly this myself in the past, so I guess I'm convinced.

> 
> Either we need to make sure that all of our strings are proper utf8/utf16, in
> which case we don't need any checks in the conversion code (though we could add
> assertions).

Yes, let's do this (we already have bug 456391 on UTF8 strings, and we can extend it to cover UTF16 as well).
Is it possible that this broke Thunderbird Trunk? I now get this error:
/src/mailnews/base/util/nsMsgI18N.cpp:549: error: no matching function for call to ‘UTF8CharEnumerator::NextChar(const char**, const char*&)’
(In reply to comment #23)
> Is it possible that this broke Thunderbird Trunk? I now get this error:
> /src/mailnews/base/util/nsMsgI18N.cpp:549: error: no matching function for call
> to ‘UTF8CharEnumerator::NextChar(const char**, const char*&)’

Yes, I'm working on it.
Depends on: 549098
Flags: in-testsuite+
Target Milestone: --- → mozilla1.9.3a2
Depends on: 552793
You need to log in before you can comment on or make changes to this bug.