Closed
Bug 1205933
Opened 9 years ago
Closed 9 years ago
Fix UTF-8 StringConversion for decrypting or encrypting payload from FxSync in SyncEngine.
Categories
(Firefox OS Graveyard :: Sync, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
FxOS-S9 (16Oct)
People
(Reporter: selee, Assigned: selee)
References
Details
(Whiteboard: [partner-cherry-picked<2015/11/10>])
Attachments
(3 files)
Reproduce steps: 1. Browse Chinese Traditional or Japanese title page. (e.g. [1][2]) 2. Sync records to FxSync. 3. Retrieve the records from FxSync to FxOS. 4. Go to browser in FxOS, and the history records that you just visited is incorrect encoding. Besides, the encoding is correct if input the URL in FxOS browser manually. [1] hinet.net [2] konami.co.jp
Assignee | ||
Comment 1•9 years ago
|
||
Assignee | ||
Comment 2•9 years ago
|
||
Assignee | ||
Comment 3•9 years ago
|
||
Screenshot: attachment 8662744 [details] - Chinese chars - link: hinet.net attachment 8662745 [details] - Japanese chars - link: konami.co.jp
Blocks: fxos-sync
Comment 4•9 years ago
|
||
Assignee | ||
Comment 5•9 years ago
|
||
In attachment 8662770 [details] [review], an ByteArray-To-UTF16-string decoder is implemented for fixing the incorrect encoding issue happens in Non-Ascii character set. We still need a encoder for FxSyncWebCrypto.encrypt usage.
Assignee: nobody → selee
Status: NEW → ASSIGNED
Assignee | ||
Comment 6•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master Hey Michiel, Could you give me a feedback for this PR? Although it needs an encoder for converting UTF16 string to ByteArray. Thank you!
Attachment #8662770 -
Flags: feedback?(mbdejong)
Assignee | ||
Comment 7•9 years ago
|
||
The converter of UTF16 String to ByteArray is updated to PR attachment 8662770 [details] [review].
Comment 8•9 years ago
|
||
HiNet.net has content="text/html; charset=utf-8" so I'm surprised that Firefox Desktop would store its title as a UTF-16 string. We need to identify what the difference is between how FxDesktop does the encryption and how we do the decryption. If we find that difference, then we can adapt to that.
Updated•9 years ago
|
Target Milestone: --- → FxOS-S7 (18Sep)
Assignee | ||
Comment 9•9 years ago
|
||
JS String is in UTF16 (UCS-2) encoding [1], and Sync service stores the records in UTF-8 [2]. Browser will convert the string from the charset (defined by <head><meta charset="utf-8">) to UTF-16 in JS automatically, and it is based on the plain text case. If there is an cryphertext encrypted in UTF-8, we still have to convert the payload from UTF-8 ByteArray to UTF-16 (JS String) after decrypting the cryphertext, and that's why we need to convert UTF-8 byte array to UTF-16 manually. In addition, we have to do it reversely for encoding. [1] http://stackoverflow.com/questions/11141136/default-javascript-character-encoding [2] BSO, https://docs.services.mozilla.com/storage/apis-1.5.html#collections
Comment 10•9 years ago
|
||
Would it be correct to say that the page titles in the DataStore are UTF-16, but the page titles on FxSync are UTF-8? In that case, I think it would be appropriate to keep the data in Kinto.js in line with the data on FxSync, and do this translation in the DataAdapter?
Comment 11•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master Thanks for reporting this and preparing this patch! As mentioned in the bug comments, before we can merge this, I think we need some more evidence that this is actually what we should be doing, and more detailed understanding of the charset discrepancy (e.g. reference to lines of Gecko code where FxDesktop does the same thing). And depending on what the exact reason is that a conversion between UTF-8 and UTF-16 is necessary, it may not apply to all collections and to all record fields, so it may be more appropriate to do it in the DataAdapter.
Attachment #8662770 -
Flags: feedback?(mbdejong) → feedback-
Updated•9 years ago
|
Target Milestone: FxOS-S7 (18Sep) → FxOS-S9 (16Oct)
Assignee | ||
Updated•9 years ago
|
Whiteboard: partner-cherry-pick
Assignee | ||
Updated•9 years ago
|
Whiteboard: partner-cherry-pick → [partner-cherry-pick]
Assignee | ||
Updated•9 years ago
|
Whiteboard: [partner-cherry-pick]
Assignee | ||
Updated•9 years ago
|
Whiteboard: [partner-cherry-pick]
Assignee | ||
Comment 12•9 years ago
|
||
Hi Natim, As far as I know, payload data from FxSync is in UTF-8 encoding before it is encrypted. Michiel and I would like to have a confirmation that my solution is correct, and we can discuss where to implement the UTF8 <-> UTF16 converter. If the payload encoding is not documented, it should be declared explicitly IMO. May we have your confirmation of the encoding information for payload data? Thank you!
Flags: needinfo?(rhubscher)
Comment 13•9 years ago
|
||
Yes everything should be in UTF-8. BSO's: https://docs.services.mozilla.com/storage/apis-1.5.html#basic-storage-object HTTP requests: https://docs.services.mozilla.com/storage/apis-1.5.html#api-instructions
Flags: needinfo?(rhubscher)
Assignee | ||
Comment 14•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master Thank you, Natim! :) Hi Michiel, Based on comment 13 and comment 9, it is necessary to convert payload data between UTF-8 <-> UTF-16. Since the payload data is converted by FxSync service, I think that should not be handled by Kinto.js or Syncto. Sync Engine is for FxSync Service currently, so I think it is worth to implement in SyncEngine codes. Could you give
Attachment #8662770 -
Flags: feedback- → feedback?(mbdejong)
Assignee | ||
Comment 15•9 years ago
|
||
Could you give me a feedback again? Thank you! (enter key sends my incomplete comment out :(
Comment 16•9 years ago
|
||
I hope that something like that can help you: http://stackoverflow.com/a/14601808/186202
Assignee | ||
Comment 17•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master Thanks for your information, Natim! I use this way in attachment 8662770 [details] [review]: http://stackoverflow.com/questions/17191945/conversion-between-utf-8-arraybuffer-and-string/22373135 Could you give a feedback from FxSync service's perspective? Thank you!
Attachment #8662770 -
Flags: feedback?(rhubscher)
Comment 18•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master (In reply to Sean Lee [:seanlee] from comment #14) > the payload data is converted by FxSync service Can you explain? IIUC, FxSync service is (and should be) all safely in UTF-8 land without any conversions going on anywhere within the FxSync service? If it's true that FxSync service converts charsets, then can you point to the Gecko or StorageServer code where this conversion happens?
Attachment #8662770 -
Flags: feedback?(mbdejong) → feedback-
Assignee | ||
Comment 19•9 years ago
|
||
(In reply to Michiel de Jong [:michielbdejong] from comment #18) > Comment on attachment 8662770 [details] [review] > [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master > > (In reply to Sean Lee [:seanlee] from comment #14) > > the payload data is converted by FxSync service > Probably it's not clear enough for my statement for this line. I mean the payload data is converted by FxSync service means converting to UTF-8. > Can you explain? IIUC, FxSync service is (and should be) all safely in UTF-8 > land without any conversions going on anywhere within the FxSync service? If > it's true that FxSync service converts charsets, then can you point to the > Gecko or StorageServer code where this conversion happens? In Javascript, UTF-16 (precisely UCS-2) is default encoding. The charset of REST command is UTF-8, so converting the result from UTF-8 to UTF-16 is done by Gecko. There is one thing that I would like to discuss is why the content is correct for ascii characters. The key point is this line: https://github.com/mozilla-b2g/gaia/pull/31903/files#diff-46615b2daf574145e345e47bbd47cc04L213 One byte UTF-8 are in ASCII charset, so fromCharCode converts it correctly coincidentally. In fromCharCode, we should consider each argument as a 21-bit number. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode Here is an example to show the Chinese character '中文' in one-byte and two-byte format. http://jsbin.com/yofiwofesu/edit?js,console Case 1 can explain why ASCII string can be converted correctly. Case 2 is following fromCharCode's description to convert characters, and the characters shows correctly. Case 3 is a wrong way to convert Chinese characters and is the case that this bug try to resolve. Hey Michiel, Could you give me a comment again? Thank you! :)
Flags: needinfo?(mbdejong)
Comment 20•9 years ago
|
||
(In reply to Sean Lee [:seanlee] from comment #19) > Here is an example to show the Chinese character '中文' in one-byte and > two-byte format. > http://jsbin.com/yofiwofesu/edit?js,console > Case 1 can explain why ASCII string can be converted correctly. > > Case 2 is following fromCharCode's description to convert characters, and > the characters shows correctly. > > Case 3 is a wrong way to convert Chinese characters and is the case that > this bug try to resolve. OK, so now I'm starting to think that we should treat all data as UTF-16 always and everywhere. :) I will look through the Gecko code, and if they use (an equivalent of) https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint instead of (an equivalent of) https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode then so I agree that so should we.
Flags: needinfo?(mbdejong)
Assignee | ||
Comment 21•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master Thanks, Michiel! Could you give a feedback again? :)
Attachment #8662770 -
Flags: feedback- → feedback?(mbdejong)
Comment 22•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master After reading http://stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16 which states that some characters occupy 1 byte in UTF-8 and 2 in UTF-16, I am confused about why the 'can verify and decrypt a record' and 'can sign and encrypt a record' tests are passing both with and without this patch. Note that both http://www.hinet.net/ and http://konami.co.jp/ have a <meta charset="UTF-8"> header, so the question is why they're displayed correctly at all in Desktop. Can you add unit tests that pass with your patch, but fail in master? I still feel we don't fully understand the situation here, and at least *if* interpreting data from FxSync as UTF-8 is incorrect, then that should be documented clearly on https://docs.services.mozilla.com/storage/apis-1.5.html#basic-storage-object and https://docs.services.mozilla.com/storage/apis-1.5.html#api-instructions.
Attachment #8662770 -
Flags: feedback?(mbdejong) → feedback+
Comment 23•9 years ago
|
||
Mystery solved! (I think) https://github.com/mozilla-b2g/gaia/pull/31903/files#r41037092 We should: * Rename this bug, the commit, and the PR, to something like "Fix UTF-8 StringConversion" * Update the `historyRecordDec` fixture to contain at least one non-ASCII character * Fix the test for `rawStringToByteArray` so that it tests that 'hi ✓' is converted to [104, 105, 32, 266, 156, 147], *not* (obviously) [104, 105, 32, 19]. * Instead of adding two new functions to stringconversion.js, just fix the existing `rawStringToByteArray` in StringConversion (maybe renaming it to utf8StringToByteArray?). * Move the `const recordJSON = String.fromCharCode.apply(...` line from fxsyncwebcrypto.js to stringconversion.js, fix it so it supports UTF-8, call it either `byteArrayToRawString` or `byteArrayToUtf8String`, and add a unit test for that function. Sorry for this bug! I can update your proposed patch with these changes on Monday, and then it should all be fixed.
Assignee | ||
Comment 24•9 years ago
|
||
I leave some comments at here: https://github.com/mozilla-b2g/gaia/pull/31903/files#r41108080 I think this issue is still related to the converter of UTF-8 <-> UTF-16. The history record in original fixtures is all characters in ASCII 1-byte character. Please see what I mentioned: https://bugzilla.mozilla.org/show_bug.cgi?id=1205933#c19 One byte UTF-8 are in ASCII charset, so fromCharCode converts it correctly coincidentally. In fromCharCode, we should consider each argument as a 21-bit number. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode So the fixtures of fxsyncwebcrypto is updated to add the character '€' for verifying the case in 3-byte UTF8 character.
Assignee | ||
Comment 25•9 years ago
|
||
Hi Michiel, I follow your suggestions and update the PR. 'rawStringToByteArray' is renamed to assciiStringToByteArray because I think that's different with what stringToUtf8ByteArray does. Could you help to see the patch again? Thank you. :)
Flags: needinfo?(mbdejong)
Comment 26•9 years ago
|
||
Thanks! Left some comments on github about test coverage, function naming, and code comments. Apart from that, LGTM. I think we could probably get rid of `stringToAsciiByteArray` altogether in favor of `stringToUtf8ByteArray`, but please double-check that all characters that can occur in the places where we use it result in an identical ByteArray when using `stringToUtf8ByteArray` as when using `stringToAsciiByteArray`. They characters to check are: * A..Z * a..z * 0..9 * + * / * . * =
Flags: needinfo?(mbdejong)
Comment 27•9 years ago
|
||
Ah, and please rename this bug, the commit, and the PR, to something like "Fix UTF-8 StringConversion", and remove all mentions of UTF-16 in the patch.
Assignee | ||
Updated•9 years ago
|
Summary: Incorrect Title Encoding for Chinese Traditional and Japanese characters in Browser History. → Fix UTF-8 StringConversion for decrypting or encrypting payload from FxSync in SyncEngine.
Assignee | ||
Comment 28•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master Hi Fernando, Could you help to review the PR? Thank you! :)
Attachment #8662770 -
Flags: review?(ferjmoreno)
Comment 29•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master ASCII is a subset of UTF-8, so you should use stringToUTF8BytesArray instead of stringToAsciiBytesArray, having two functions with the same functionality to maintain seems wrong to me. Also if you pass UTF8 string to stringToAsciiBytesArray you may have unexpected/bogus behaviours. nit: I think it should be BytesArray instead of ByteArray (because it is an array of bytes) Appart from that, it looks good to me.
Attachment #8662770 -
Flags: feedback?(rhubscher) → feedback-
Comment 30•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master Michiel knows that code better than me.
Attachment #8662770 -
Flags: review?(ferjmoreno) → review?(mbdejong)
Comment 31•9 years ago
|
||
OK, so it seems that 'byte array' is a term from C/C++/C#, and in JavaScript it seems to refer to Int8Array, not Uint8Array, so maybe stringToUtf8Uint8Array is a more precise name than stringToUtf8ByteArray. I agree with Natim that we should not have two functions that do the same, so please merge them into one.
Flags: needinfo?(selee)
Assignee | ||
Comment 32•9 years ago
|
||
Hi Michiel, I did some function renaming and use stringToUtf8ByteArray for string conversion. Could you help to review my PR again? Thank you!
Flags: needinfo?(selee)
Assignee | ||
Comment 33•9 years ago
|
||
Hey Michiel, Natim, If the string which contains the characters listed at comment 26 applies to stringToUtf8ByteArray or stringToAsciiByteArray, I suppose the both output will be the same. Do we have to check the characters? I am confused here. Could you give me a hint? stringToAsciiByteArray (I renamed it to rawStringToUint8Array) is only used by base64StringToUint8Array. base64StringToUint8Array has the input of a test case converting 'Af9=' to [0x01, 0xFF], and stringToUtf8ByteArray does not work here. How do you think if stringToAsciiByteArray is remained for base64StringToUint8Array usage? Thank you!! :)
Flags: needinfo?(rhubscher)
Flags: needinfo?(mbdejong)
Comment 34•9 years ago
|
||
As far as I know, 'Af9=' is not an encoded string (nor ASCII, nor UTF-8). It is just a ByteArray encoded in Base64.
Flags: needinfo?(rhubscher)
Comment 35•9 years ago
|
||
What does you the string 'w6nDoA==' with both functions?
Comment 36•9 years ago
|
||
What give you the string 'w6nDoA==' with both functions?
Comment 37•9 years ago
|
||
If stringToAsciiByteArray gives you '\xc3\xa9\xc3\xa0' and stringToUtf8ByteArray gives you '\xe9\xe0' then we should probably keep both. The first one being raw decode bytes and the second one being UTF-8 characters. In Python the difference looks like this one: >>> base64.b64decode('w6nDoA==') '\xc3\xa9\xc3\xa0' >>> base64.b64decode('w6nDoA==').decode('utf-8') u'\xe9\xe0'
Assignee | ||
Comment 38•9 years ago
|
||
When stringToAsciiByteArray is replaced with stringToUtf8ByteArray, the test is failed. Here is the byte array of stringToUtf8ByteArray's result: { '0': 195, '1': 131, '2': 194, '3': 169, '4': 195, '5': 131, '6': 194, '7': 160 }
Comment 39•9 years ago
|
||
(In reply to Sean Lee [:seanlee] from comment #33) > Hey Michiel, Natim, > > If the string which contains the characters listed at comment 26 applies to > stringToUtf8ByteArray or stringToAsciiByteArray, I suppose the both output > will be the same. > Do we have to check the characters? > I am confused here. Could you give me a hint? > > stringToAsciiByteArray (I renamed it to rawStringToUint8Array) is only used > by base64StringToUint8Array. > base64StringToUint8Array has the input of a test case converting 'Af9=' to > [0x01, 0xFF], and stringToUtf8ByteArray does not work here. How do you think > if stringToAsciiByteArray is remained for base64StringToUint8Array usage? > > Thank you!! :) I'm seeing [ 65, 102, 57, 61 ] for both stringToUtf8Uint8Array('Af9=') and rawStringToUint8Array('Af9=')
Flags: needinfo?(mbdejong)
Assignee | ||
Comment 40•9 years ago
|
||
Hey Michiel, I use this diff with the attachment 8662770 [details] [review] together: diff --git a/apps/sync/js/crypto/stringconversion.js b/apps/sync/js/crypto/stringconversion.js index 3ab1448..27f5d6f 100644 --- a/apps/sync/js/crypto/stringconversion.js +++ b/apps/sync/js/crypto/stringconversion.js @@ -41,7 +41,7 @@ const StringConversion = Object.freeze((() => { throw new Error(`Number of base64 digits must be a multiple of 4 to con\ vert to bytes`); } - return StringConversion.rawStringToUint8Array(window.atob(base64)); + return StringConversion.stringToUtf8Uint8Array(window.atob(base64)); }, and the test failed at here: https://github.com/weilonge/gaia/blob/seanlee/DataSync/master/Bug1205933/apps/sync/test/unit/crypto/stringconversion_test.js#L48 The output of stringToUtf8Uint8Array is { '0': 1, '1': 195, '2': 191 }, and it is different with yours. Could you help to see the result? Thank you!
Flags: needinfo?(mbdejong)
Comment 41•9 years ago
|
||
(In reply to Sean Lee [:seanlee] from comment #40) > Hey Michiel, > > I use this diff with the attachment 8662770 [details] [review] together: > diff --git a/apps/sync/js/crypto/stringconversion.js > b/apps/sync/js/crypto/stringconversion.js > index 3ab1448..27f5d6f 100644 > --- a/apps/sync/js/crypto/stringconversion.js > +++ b/apps/sync/js/crypto/stringconversion.js > @@ -41,7 +41,7 @@ const StringConversion = Object.freeze((() => { > throw new Error(`Number of base64 digits must be a multiple of 4 to > con\ > vert to bytes`); > } > - return StringConversion.rawStringToUint8Array(window.atob(base64)); > + return StringConversion.stringToUtf8Uint8Array(window.atob(base64)); > }, > > > and the test failed at here: > https://github.com/weilonge/gaia/blob/seanlee/DataSync/master/Bug1205933/ > apps/sync/test/unit/crypto/stringconversion_test.js#L48 > > The output of stringToUtf8Uint8Array is { '0': 1, '1': 195, '2': 191 }, and > it is different with yours. Could you help to see the result? > > Thank you! Right, so we need to fix how we use atob and btoa then, looks like they are ASCII-only functions http://mdn.beonex.com/en/DOM/window.btoa.html#Unicode_Strings Or if we only need rawStringToUint8Array inside our Base64 functions, then we can move them inside there. The hex functions are fine, right?
Flags: needinfo?(mbdejong)
Comment 42•9 years ago
|
||
(In reply to Michiel de Jong [:michielbdejong] from comment #41) > Right, so we need to fix how we use atob and btoa then, looks like they are > ASCII-only functions > http://mdn.beonex.com/en/DOM/window.btoa.html#Unicode_Strings Sorry, that was not an accurate statement. Neither ASCII nor UTF-8 is involved here, the Base64 functions are used to work with raw binary data which has no meaning in any charset. > Or if we only need rawStringToUint8Array inside our Base64 functions, then > we can move them inside there. This is not the case, the hex functions also need to work with raw data, so it's correct to keep the rawStringTo... and ...ToRawString functions in StringConversion. I'll have another look at the patch.
Comment 43•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master You were right Sean, we still need the rawString-methods because the Base64- and Hex-methods use them work with binary data. My only concern is about the test fixture, https://github.com/mozilla-b2g/gaia/pull/31903/files#r41390785 - otherwise r+!
Assignee | ||
Comment 44•9 years ago
|
||
Hey Michiel, Thanks a lot to review my patch and provide these suggestions! The test is updated based on your comments. Could you please change the review flag to r+?
Flags: needinfo?(mbdejong)
Comment 45•9 years ago
|
||
Comment on attachment 8662770 [details] [review] [gaia] weilonge:seanlee/DataSync/master/Bug1205933 > mozilla-b2g:master Awesome! Thanks a lot for your work on this Sean, and sorry for the amount of back-and-forth while I was learning about the whole charsets issue.
Flags: needinfo?(mbdejong)
Attachment #8662770 -
Flags: review?(mbdejong) → review+
Updated•9 years ago
|
Attachment #8662770 -
Flags: feedback- → feedback+
Assignee | ||
Comment 46•9 years ago
|
||
Thanks a lot for Natim and Michiel's suggestions :) landed on master: https://github.com/mozilla-b2g/gaia/commit/22f18bf1dc01a565fb3cdd008b2c0c9a6ed4d5db gaia test: https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=d29bb7761abe921c8dc25c2ff847d68758b6126b
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Whiteboard: [partner-cherry-pick] → [partner-cherry-picked<2015/11/10>]
You need to log in
before you can comment on or make changes to this bug.
Description
•