Closed Bug 1205933 Opened 9 years ago Closed 9 years ago

Fix UTF-8 StringConversion for decrypting or encrypting payload from FxSync in SyncEngine.

Categories

(Firefox OS Graveyard :: Sync, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
FxOS-S9 (16Oct)

People

(Reporter: selee, Assigned: selee)

References

Details

(Whiteboard: [partner-cherry-picked<2015/11/10>])

Attachments

(3 files)

Reproduce steps:
1. Browse Chinese Traditional or Japanese title page. (e.g. [1][2])
2. Sync records to FxSync.
3. Retrieve the records from FxSync to FxOS.
4. Go to browser in FxOS, and the history records that you just visited is incorrect encoding.

Besides, the encoding is correct if input the URL in FxOS browser manually.

[1] hinet.net
[2] konami.co.jp
Attached image CHT-hinet.net.png
Attached image JP-konami.co.jp.png
Screenshot:

attachment 8662744 [details] - Chinese chars - link: hinet.net

attachment 8662745 [details] - Japanese chars - link: konami.co.jp
Blocks: fxos-sync
In attachment 8662770 [details] [review], an ByteArray-To-UTF16-string decoder is implemented for fixing the incorrect encoding issue happens in Non-Ascii character set. We still need a encoder for FxSyncWebCrypto.encrypt usage.
Assignee: nobody → selee
Status: NEW → ASSIGNED
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

Hey Michiel,

Could you give me a feedback for this PR?
Although it needs an encoder for converting UTF16 string to ByteArray.
Thank you!
Attachment #8662770 - Flags: feedback?(mbdejong)
The converter of UTF16 String to ByteArray is updated to PR attachment 8662770 [details] [review].
HiNet.net has content="text/html; charset=utf-8" so I'm surprised that Firefox Desktop would store its title as a UTF-16 string. We need to identify what the difference is between how FxDesktop does the encryption and how we do the decryption. If we find that difference, then we can adapt to that.
Target Milestone: --- → FxOS-S7 (18Sep)
JS String is in UTF16 (UCS-2) encoding [1], and Sync service stores the records in UTF-8 [2].

Browser will convert the string from the charset (defined by <head><meta charset="utf-8">) to UTF-16 in JS automatically, and it is based on the plain text case.

If there is an cryphertext encrypted in UTF-8, we still have to convert the payload from UTF-8 ByteArray to UTF-16 (JS String) after decrypting the cryphertext, and that's why we need to convert UTF-8 byte array to UTF-16 manually.

In addition, we have to do it reversely for encoding.

[1] http://stackoverflow.com/questions/11141136/default-javascript-character-encoding
[2] BSO, https://docs.services.mozilla.com/storage/apis-1.5.html#collections
Would it be correct to say that the page titles in the DataStore are UTF-16, but the page titles on FxSync are UTF-8?

In that case, I think it would be appropriate to keep the data in Kinto.js in line with the data on FxSync, and do this translation in the DataAdapter?
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

Thanks for reporting this and preparing this patch!

As mentioned in the bug comments, before we can merge this, I think we need some more evidence that this is actually what we should be doing, and more detailed understanding of the charset discrepancy (e.g. reference to lines of Gecko code where FxDesktop does the same thing).

And depending on what the exact reason is that a conversion between UTF-8 and UTF-16 is necessary, it may not apply to all collections and to all record fields, so it may be more appropriate to do it in the DataAdapter.
Attachment #8662770 - Flags: feedback?(mbdejong) → feedback-
Target Milestone: FxOS-S7 (18Sep) → FxOS-S9 (16Oct)
Whiteboard: partner-cherry-pick
Whiteboard: partner-cherry-pick → [partner-cherry-pick]
Whiteboard: [partner-cherry-pick]
Whiteboard: [partner-cherry-pick]
Hi Natim,

As far as I know, payload data from FxSync is in UTF-8 encoding before it is encrypted.
Michiel and I would like to have a confirmation that my solution is correct, and we can discuss where to implement the UTF8 <-> UTF16 converter.
If the payload encoding is not documented, it should be declared explicitly IMO. 

May we have your confirmation of the encoding information for payload data?

Thank you!
Flags: needinfo?(rhubscher)
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

Thank you, Natim! :)

Hi Michiel,

Based on comment 13 and comment 9, it is necessary to convert payload data between UTF-8 <-> UTF-16. 
Since the payload data is converted by FxSync service, I think that should not be handled by Kinto.js or Syncto.
Sync Engine is for FxSync Service currently, so I think it is worth to implement in SyncEngine codes.

Could you give
Attachment #8662770 - Flags: feedback- → feedback?(mbdejong)
Could you give me a feedback again? Thank you!   (enter key sends my incomplete comment out :(
I hope that something like that can help you: http://stackoverflow.com/a/14601808/186202
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

Thanks for your information, Natim!

I use this way in attachment 8662770 [details] [review]:
http://stackoverflow.com/questions/17191945/conversion-between-utf-8-arraybuffer-and-string/22373135

Could you give a feedback from FxSync service's perspective?

Thank you!
Attachment #8662770 - Flags: feedback?(rhubscher)
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

(In reply to Sean Lee [:seanlee] from comment #14)
> the payload data is converted by FxSync service

Can you explain? IIUC, FxSync service is (and should be) all safely in UTF-8 land without any conversions going on anywhere within the FxSync service? If it's true that FxSync service converts charsets, then can you point to the Gecko or StorageServer code where this conversion happens?
Attachment #8662770 - Flags: feedback?(mbdejong) → feedback-
(In reply to Michiel de Jong [:michielbdejong] from comment #18)
> Comment on attachment 8662770 [details] [review]
> [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master
> 
> (In reply to Sean Lee [:seanlee] from comment #14)
> > the payload data is converted by FxSync service
> 
Probably it's not clear enough for my statement for this line. I mean the payload data is converted by FxSync service means converting to UTF-8.
> Can you explain? IIUC, FxSync service is (and should be) all safely in UTF-8
> land without any conversions going on anywhere within the FxSync service? If
> it's true that FxSync service converts charsets, then can you point to the
> Gecko or StorageServer code where this conversion happens?
In Javascript, UTF-16 (precisely UCS-2) is default encoding. The charset of REST command is UTF-8, so converting the result from UTF-8 to UTF-16 is done by Gecko.

There is one thing that I would like to discuss is why the content is correct for ascii characters.

The key point is this line:
https://github.com/mozilla-b2g/gaia/pull/31903/files#diff-46615b2daf574145e345e47bbd47cc04L213

One byte UTF-8 are in ASCII charset, so fromCharCode converts it correctly coincidentally.
In fromCharCode, we should consider each argument as a 21-bit number. 
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode

Here is an example to show the Chinese character '中文' in one-byte and two-byte format. 
http://jsbin.com/yofiwofesu/edit?js,console
Case 1 can explain why ASCII string can be converted correctly.

Case 2 is following fromCharCode's description to convert characters, and the characters shows correctly.

Case 3 is a wrong way to convert Chinese characters and is the case that this bug try to resolve.

Hey Michiel,

Could you give me a comment again? Thank you! :)
Flags: needinfo?(mbdejong)
(In reply to Sean Lee [:seanlee] from comment #19)
> Here is an example to show the Chinese character '中文' in one-byte and
> two-byte format. 
> http://jsbin.com/yofiwofesu/edit?js,console
> Case 1 can explain why ASCII string can be converted correctly.
> 
> Case 2 is following fromCharCode's description to convert characters, and
> the characters shows correctly.
> 
> Case 3 is a wrong way to convert Chinese characters and is the case that
> this bug try to resolve.

OK, so now I'm starting to think that we should treat all data as UTF-16 always and everywhere. :) I will look through the Gecko code, and if they use (an equivalent of) https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint instead of (an equivalent of) https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode then so I agree that so should we.
Flags: needinfo?(mbdejong)
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

Thanks, Michiel!

Could you give a feedback again? :)
Attachment #8662770 - Flags: feedback- → feedback?(mbdejong)
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

After reading http://stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16 which states that some characters occupy 1 byte in UTF-8 and 2 in UTF-16, I am confused about why the 'can verify and decrypt a record' and 'can sign and encrypt a record' tests are passing both with and without this patch.

Note that both http://www.hinet.net/ and http://konami.co.jp/ have a <meta charset="UTF-8"> header, so the question is why they're displayed correctly at all in Desktop.

Can you add unit tests that pass with your patch, but fail in master?

I still feel we don't fully understand the situation here, and at least *if* interpreting data from FxSync as UTF-8 is incorrect, then that should be documented clearly on https://docs.services.mozilla.com/storage/apis-1.5.html#basic-storage-object and https://docs.services.mozilla.com/storage/apis-1.5.html#api-instructions.
Attachment #8662770 - Flags: feedback?(mbdejong) → feedback+
Mystery solved! (I think) https://github.com/mozilla-b2g/gaia/pull/31903/files#r41037092

We should:
* Rename this bug, the commit, and the PR, to something like "Fix UTF-8 StringConversion"
* Update the `historyRecordDec` fixture to contain at least one non-ASCII character
* Fix the test for `rawStringToByteArray` so that it tests that 'hi ✓' is converted to [104, 105, 32, 266, 156, 147], *not* (obviously) [104, 105, 32, 19].
* Instead of adding two new functions to stringconversion.js, just fix the existing `rawStringToByteArray` in StringConversion (maybe renaming it to utf8StringToByteArray?).
* Move the `const recordJSON = String.fromCharCode.apply(...` line from fxsyncwebcrypto.js to stringconversion.js, fix it so it supports UTF-8, call it either `byteArrayToRawString` or `byteArrayToUtf8String`, and add a unit test for that function.

Sorry for this bug! I can update your proposed patch with these changes on Monday, and then it should all be fixed.
I leave some comments at here:
https://github.com/mozilla-b2g/gaia/pull/31903/files#r41108080

I think this issue is still related to the converter of UTF-8 <-> UTF-16.

The history record in original fixtures is all characters in ASCII 1-byte character.
Please see what I mentioned:
https://bugzilla.mozilla.org/show_bug.cgi?id=1205933#c19

One byte UTF-8 are in ASCII charset, so fromCharCode converts it correctly coincidentally.
In fromCharCode, we should consider each argument as a 21-bit number. 
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode

So the fixtures of fxsyncwebcrypto is updated to add the character '€' for verifying the case in 3-byte UTF8 character.
Hi Michiel,

I follow your suggestions and update the PR.
'rawStringToByteArray' is renamed to assciiStringToByteArray because I think that's different with what stringToUtf8ByteArray does.
Could you help to see the patch again?

Thank you. :)
Flags: needinfo?(mbdejong)
Thanks! Left some comments on github about test coverage, function naming, and code comments. Apart from that, LGTM.

I think we could probably get rid of `stringToAsciiByteArray` altogether in favor of `stringToUtf8ByteArray`, but please double-check that all characters that can occur in the places where we use it result in an identical ByteArray when using `stringToUtf8ByteArray` as when using `stringToAsciiByteArray`. They characters to check are:

* A..Z
* a..z
* 0..9
* +
* /
* .
* =
Flags: needinfo?(mbdejong)
Ah, and please rename this bug, the commit, and the PR, to something like "Fix UTF-8 StringConversion", and remove all mentions of UTF-16 in the patch.
Summary: Incorrect Title Encoding for Chinese Traditional and Japanese characters in Browser History. → Fix UTF-8 StringConversion for decrypting or encrypting payload from FxSync in SyncEngine.
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

Hi Fernando,

Could you help to review the PR? Thank you! :)
Attachment #8662770 - Flags: review?(ferjmoreno)
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

ASCII is a subset of UTF-8, so you should use stringToUTF8BytesArray  instead of stringToAsciiBytesArray, having two functions with the same functionality to maintain seems wrong to me.

Also if you pass UTF8 string to stringToAsciiBytesArray you may have unexpected/bogus behaviours.

nit: I think it should be BytesArray instead of ByteArray  (because it is an array of bytes)

Appart from that, it looks good to me.
Attachment #8662770 - Flags: feedback?(rhubscher) → feedback-
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

Michiel knows that code better than me.
Attachment #8662770 - Flags: review?(ferjmoreno) → review?(mbdejong)
OK, so it seems that 'byte array' is a term from C/C++/C#, and in JavaScript it seems to refer to Int8Array, not Uint8Array, so maybe stringToUtf8Uint8Array is a more precise name than stringToUtf8ByteArray.

I agree with Natim that we should not have two functions that do the same, so please merge them into one.
Flags: needinfo?(selee)
Hi Michiel,

I did some function renaming and use stringToUtf8ByteArray for string conversion.
Could you help to review my PR again?
Thank you!
Flags: needinfo?(selee)
Hey Michiel, Natim,

If the string which contains the characters listed at comment 26 applies to stringToUtf8ByteArray or stringToAsciiByteArray, I suppose the both output will be the same.
Do we have to check the characters?
I am confused here. Could you give me a hint?

stringToAsciiByteArray (I renamed it to rawStringToUint8Array) is only used by base64StringToUint8Array.
base64StringToUint8Array has the input of a test case converting 'Af9=' to [0x01, 0xFF], and stringToUtf8ByteArray does not work here. How do you think if stringToAsciiByteArray is remained for base64StringToUint8Array usage?

Thank you!! :)
Flags: needinfo?(rhubscher)
Flags: needinfo?(mbdejong)
As far as I know, 'Af9=' is not an encoded string (nor ASCII, nor UTF-8). It is just a ByteArray encoded in Base64.
Flags: needinfo?(rhubscher)
What does you the string 'w6nDoA==' with both functions?
What give you the string 'w6nDoA==' with both functions?
If stringToAsciiByteArray gives you '\xc3\xa9\xc3\xa0' and stringToUtf8ByteArray gives you '\xe9\xe0' then we should probably keep both. The first one being raw decode bytes and the second one being UTF-8 characters.

In Python the difference looks like this one:

    >>> base64.b64decode('w6nDoA==')
    '\xc3\xa9\xc3\xa0'

    >>> base64.b64decode('w6nDoA==').decode('utf-8')
    u'\xe9\xe0'
When stringToAsciiByteArray is replaced with stringToUtf8ByteArray, the test is failed.
Here is the byte array of stringToUtf8ByteArray's result:
{ '0': 195,
  '1': 131,
  '2': 194,
  '3': 169,
  '4': 195,
  '5': 131,
  '6': 194,
  '7': 160 }
(In reply to Sean Lee [:seanlee] from comment #33)
> Hey Michiel, Natim,
> 
> If the string which contains the characters listed at comment 26 applies to
> stringToUtf8ByteArray or stringToAsciiByteArray, I suppose the both output
> will be the same.
> Do we have to check the characters?
> I am confused here. Could you give me a hint?
> 
> stringToAsciiByteArray (I renamed it to rawStringToUint8Array) is only used
> by base64StringToUint8Array.
> base64StringToUint8Array has the input of a test case converting 'Af9=' to
> [0x01, 0xFF], and stringToUtf8ByteArray does not work here. How do you think
> if stringToAsciiByteArray is remained for base64StringToUint8Array usage?
> 
> Thank you!! :)

I'm seeing [ 65, 102, 57, 61 ] for both stringToUtf8Uint8Array('Af9=') and rawStringToUint8Array('Af9=')
Flags: needinfo?(mbdejong)
Hey Michiel,

I use this diff with the attachment 8662770 [details] [review] together:
diff --git a/apps/sync/js/crypto/stringconversion.js b/apps/sync/js/crypto/stringconversion.js
index 3ab1448..27f5d6f 100644
--- a/apps/sync/js/crypto/stringconversion.js
+++ b/apps/sync/js/crypto/stringconversion.js
@@ -41,7 +41,7 @@ const StringConversion = Object.freeze((() => {
         throw new Error(`Number of base64 digits must be a multiple of 4 to con\
 vert to bytes`);
       }
-      return StringConversion.rawStringToUint8Array(window.atob(base64));
+      return StringConversion.stringToUtf8Uint8Array(window.atob(base64));
     },


and the test failed at here:
https://github.com/weilonge/gaia/blob/seanlee/DataSync/master/Bug1205933/apps/sync/test/unit/crypto/stringconversion_test.js#L48

The output of stringToUtf8Uint8Array is { '0': 1, '1': 195, '2': 191 }, and it is different with yours. Could you help to see the result?

Thank you!
Flags: needinfo?(mbdejong)
(In reply to Sean Lee [:seanlee] from comment #40)
> Hey Michiel,
> 
> I use this diff with the attachment 8662770 [details] [review] together:
> diff --git a/apps/sync/js/crypto/stringconversion.js
> b/apps/sync/js/crypto/stringconversion.js
> index 3ab1448..27f5d6f 100644
> --- a/apps/sync/js/crypto/stringconversion.js
> +++ b/apps/sync/js/crypto/stringconversion.js
> @@ -41,7 +41,7 @@ const StringConversion = Object.freeze((() => {
>          throw new Error(`Number of base64 digits must be a multiple of 4 to
> con\
>  vert to bytes`);
>        }
> -      return StringConversion.rawStringToUint8Array(window.atob(base64));
> +      return StringConversion.stringToUtf8Uint8Array(window.atob(base64));
>      },
> 
> 
> and the test failed at here:
> https://github.com/weilonge/gaia/blob/seanlee/DataSync/master/Bug1205933/
> apps/sync/test/unit/crypto/stringconversion_test.js#L48
> 
> The output of stringToUtf8Uint8Array is { '0': 1, '1': 195, '2': 191 }, and
> it is different with yours. Could you help to see the result?
> 
> Thank you!

Right, so we need to fix how we use atob and btoa then, looks like they are ASCII-only functions http://mdn.beonex.com/en/DOM/window.btoa.html#Unicode_Strings

Or if we only need rawStringToUint8Array inside our Base64 functions, then we can move them inside there.

The hex functions are fine, right?
Flags: needinfo?(mbdejong)
(In reply to Michiel de Jong [:michielbdejong] from comment #41)
> Right, so we need to fix how we use atob and btoa then, looks like they are
> ASCII-only functions
> http://mdn.beonex.com/en/DOM/window.btoa.html#Unicode_Strings

Sorry, that was not an accurate statement. Neither ASCII nor UTF-8 is involved here, the Base64 functions are used to work with raw binary data which has no meaning in any charset.

> Or if we only need rawStringToUint8Array inside our Base64 functions, then
> we can move them inside there.

This is not the case, the hex functions also need to work with raw data, so it's correct to keep the rawStringTo... and ...ToRawString functions in StringConversion.

I'll have another look at the patch.
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

You were right Sean, we still need the rawString-methods because the Base64- and Hex-methods use them work with binary data.

My only concern is about the test fixture, https://github.com/mozilla-b2g/gaia/pull/31903/files#r41390785 - otherwise r+!
Hey Michiel,

Thanks a lot to review my patch and provide these suggestions! The test is updated based on your comments.
Could you please change the review flag to r+?
Flags: needinfo?(mbdejong)
Comment on attachment 8662770 [details] [review]
[gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master

Awesome! Thanks a lot for your work on this Sean, and sorry for the amount of back-and-forth while I was learning about the whole charsets issue.
Flags: needinfo?(mbdejong)
Attachment #8662770 - Flags: review?(mbdejong) → review+
Attachment #8662770 - Flags: feedback- → feedback+
Thanks a lot for Natim and Michiel's suggestions :)

landed on master: https://github.com/mozilla-b2g/gaia/commit/22f18bf1dc01a565fb3cdd008b2c0c9a6ed4d5db
gaia test: https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=d29bb7761abe921c8dc25c2ff847d68758b6126b
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [partner-cherry-pick] → [partner-cherry-picked<2015/11/10>]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: