Non-ASCII (Plane 1 surrogates) link not correctly parsed in UTF-16

RESOLVED FIXED

Status

()

Core
Internationalization
RESOLVED FIXED
15 years ago
14 years ago

People

(Reporter: Niklas Dougherty, Assigned: Shanjian Li)

Tracking

(Depends on: 1 bug, {intl})

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(1 attachment, 1 obsolete attachment)

1014 bytes, patch
nhottanscp
: review+
Darin Fisher
: superreview+
Details | Diff | Splinter Review
(Reporter)

Description

15 years ago
User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.2b) Gecko/20021031
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.2b) Gecko/20021031

Go to http://bugzilla.mozilla.org/attachment.cgi?id=106911&action=view and copy
the link displayed (it doesn't go to any existing file). Paste the copied link
into the URL field.

Result: http://bugzilla.mozilla.org/%D8.xhtml

Desired result:
http://bugzilla.mozilla.org/%F0%90%8D%85%F0%90%8C%BF%F0%90%8C%BB%F0%90%8D%86%F0%90%8C%B9%F0%90%8C%BB%F0%90%8C%B0.xhtml

The link consists of Gothic characters in Unicode Plane 1, so the parsing method
used for URI:s seemingly doesn't consider surrogates.


Reproducible: Always

Steps to Reproduce:
1.
2.
3.
(Reporter)

Comment 1

15 years ago
Somehow parsing does work, since you can view the source, copy the link there
and then paste it into the URL field. It just doesn't do it right when copying
from link (or clicking the link in normal fashion).
intl.  Happens on PC Linux too.
Assignee: asa → smontagu
Status: UNCONFIRMED → NEW
Component: Browser-General → Internationalization
Ever confirmed: true
OS: MacOS X → All
QA Contact: asa → ylong
Hardware: Macintosh → All
To Shanjian.

I notice some other problems:

When hovering over the link in the attachment, the URL is displayed incorrectly
in the status bar (as CJK characters)

When hovering over the URL-encoded version in comment 1, the URL is displayed in
a different incorrect form in the status bar.
Assignee: smontagu → shanjian

Comment 4

15 years ago
Seems there is a difference between branch and trunk build with the problem in
original report - when I right click to copy the link , I got:
http://bugzilla.mozilla.org/%FF%FD%FF%FD%FF%FD%FF%FD%FF%FD%FF%FD%FF%FD.xhtml
on WinXP-SimpChinese with 1.02 branch build.
while in 11-20 trunk build I have:
http://bugzilla.mozilla.org/%D8.xhtml
Keywords: intl

Updated

15 years ago
Blocks: 157673

Comment 5

15 years ago
Is this working on IE?

Updated

15 years ago
Depends on: 169388

Comment 6

15 years ago
On my WinXP-SC, IE doesn't display properly when load page: 
http://bugzilla.mozilla.org/attachment.cgi?id=106911&action=view

When click on the link to open a page in URL bar:
http://bugzilla.mozilla.org/<squares>.xhtml
but when I copy those string from IE URL bar to mozilla, then will display find
with those (hebrew looking?) characters (not escaped).
(Reporter)

Comment 7

15 years ago
Yes, those Gothic characters look a bit like Hebrew. Same kind of rounded style.

Bug 168382 handles garbage in status bar for UTF-16 pages... thought this was
fixed, but that fix also doesn't consider surrogates in the link. Must be the
same routines involved somehow.

Comment 8

15 years ago
I talked to Shanjian about how do we want to handle UTF-16 URI.
One way is just follow RFC 2396 and escape whatever necessary.
http://www.ietf.org/rfc/rfc2396.txt
Other way is to generate UTF-8 URI. I think that is good if UTF-16 is not really
used currently. UTF-8 URI is proposed in IRI draft.
http://www.ietf.org/internet-drafts/draft-duerst-iri-02.txt

cc to drain
(Reporter)

Comment 9

15 years ago
http://www.student.lu.se/~kin02ndo/gothic/ works as intended in UTF-8.
Summary: Non-ASCII (Plane 1 surrogates) link not correctly parsed → Non-ASCII (Plane 1 surrogates) link not correctly parsed in UTF-16
(Assignee)

Comment 10

15 years ago
If we want to handle utf-16 encoding as it is, many code need to be changed.
Some string assignment assume the string is null terminated, which does not work
for utf-16. Some part of the code do no conversion for ascii char. Some code
assuming scheme part (and probably other parts) can only be encoded in ascii.
All those issues need to be cleared if we go that approach. 
(Reporter)

Comment 11

15 years ago
UTF-16 URI:s work as they should for BMP (plane 0):
http://www.student.lu.se/~kin02ndo/%E4%B8%AD%E5%9B%BD%E8%AF%97/

It is just when you get surrogates that things go wrong. (With reservation for
what is causing the crash in bug 180263.)

Comment 12

15 years ago
UTF-8 can represent all of the UCS-4, right?  so, isn't it possible to always
perform a non-lossy conversion from UTF-16 to UTF-8?  if so, then why don't we
just do that for URLs.  i suppose we could run into trouble if code converts
UTF-8 to UCS-2... is it possible to UTF-16 everywhere that UCS-2 is currently
used?  are windows wide-APIs UCS-2 or UTF-16?

Comment 13

15 years ago
Let me clarify a couple of things.

1) Internally, we store URI as UTF-8 with the document charset (as
originCharset). Exception is if the URI is already escaped in the document then
that is stored as escaped.

2) When we send to the server, UTF-8 to originChrset conversion may be applied
depends on the protocol.

So, we are talking about #2 issue, whether we want to send UTF-8 or UTF-16 to
the server. Is that correct? Or is there dataloss internally for the UTF-16 case?


about darin's comment,
>UTF-8 can represent all of the UCS-4, right? 
that is right

>i suppose we could run into trouble if code converts
>UTF-8 to UCS-2
I think the conversion function we have (e.g. ConvertUTF8toUCS2) already takes
care UTF-16. Shanjian, is that correct?



(Assignee)

Comment 14

15 years ago
>> UTF-16 URI:s work as they should for BMP (plane 0):
>> http://www.student.lu.se/~kin02ndo/%E4%B8%AD%E5%9B%BD%E8%AF%97/
>> It is just when you get surrogates that things go wrong. (With
>> reservation for
>> what is causing the crash in bug 180263.)
Not really. First, the url you referenced is not really encoded in UTF-16, if so
thing like "http" should be encoded as "%00h%00t%00t%00p"...

Sencond, the string you mentioned does not contains the byte 0x00. It is 0x00
caused most of the problem, not surrogates.

>>I think the conversion function we have (e.g. ConvertUTF8toUCS2) >>already takes
>>care UTF-16. Shanjian, is that correct?
Our conversion function between utf8 and utf-16 works as expected. The problem
is in other places where we have some wrong assumption. (like ascii remains
ascii, not true for utf-16). 




Status: NEW → ASSIGNED

Comment 15

15 years ago
>The problem
>is in other places where we have some wrong assumption. (like ascii remains
>ascii, not true for utf-16). 
But as we store URI as UTF-8, we are not supposed to have UTF-16 URI
(AUTF8String is used in nsIURI.idl). Are you talking about the code before the
URI is stored in nsIURI?





(Assignee)

Comment 16

15 years ago
I just read rfc2396, it didn't mention encoding uri in utf-16. The rfc also has
the same assumption that ascii is a subset. In fact, I found this is a necessary
assumption. One just can't encode uri in utf-16, because the escape method won't
work for utf-16. '%', '0', '1' ... are all ascii. So I guess we have to use
utf-8 in this situation. 
(Assignee)

Comment 17

15 years ago
Created attachment 107085 [details] [diff] [review]
patch

Comment 18

15 years ago
Comment on attachment 107085 [details] [diff] [review]
patch

>+  if (originCharset.Equals(NS_LITERAL_STRING("UTF-16BE")) ||
>+      originCharset.Equals(NS_LITERAL_STRING("UTF-16LE")) )
>+    originCharset = NS_LITERAL_STRING("UTF-8");
>+
>   rv = nsHTMLUtils::IOService->NewURI(NS_ConvertUCS2toUTF8(aSpec),
>                                       NS_LossyConvertUCS2toASCII(originCharset).get(),

this looks like two buffer copies to me.  can you do something like this
instead:

   if (originCharset.Equals(NS_LITERAL_STRING("UTF-16BE")) ||
       originCharset.Equals(NS_LITERAL_STRING("UTF-16LE")) )
     originCharset.Truncate(); // empty charset implies UTF-8

   rv = nsHTMLUtils::IOService->NewURI(NS_ConvertUCS2toUTF8(aSpec),
				      
NS_LossyConvertUCS2toASCII(originCharset).get(),

an empty string implies UTF-8.
Attachment #107085 - Flags: review-

Comment 19

15 years ago
I think we want to add UTF-16, UTF-32, UTF32-BE, UTF-32LE which have the same issue.
(Assignee)

Comment 20

15 years ago
Created attachment 107159 [details] [diff] [review]
new patch

Since we know of all UTF encodings, only UTF-8 is NULL safe. I will speed the
comparison by only checking prefix utf-.
Attachment #107085 - Attachment is obsolete: true

Comment 21

15 years ago
Is the charset name canonicalized at that point and no need to check lower cases?
(Assignee)

Comment 22

15 years ago
The charset name should have been normalized till this stage. Whenever we get a
charset name from outside world, it should be normalized in the first place. If
not, that is a bug. For normal sources like meta, http header, I am sure that
has been done. 

Updated

15 years ago
Attachment #107159 - Flags: review+
(Assignee)

Comment 23

15 years ago
darin, could you sr?

Updated

15 years ago
Attachment #107159 - Flags: superreview+
(Assignee)

Comment 24

15 years ago
fix checked in.
Status: ASSIGNED → RESOLVED
Last Resolved: 15 years ago
Resolution: --- → FIXED
Out of curiousity, would it have made more sense to put this check inside
NS_NewURI so other callers could benefit too?

Comment 26

15 years ago
bz:
hmm... yeah.  that does sound like a good idea (nsIIOService::newURI you mean?).  

shanjian, nhotta:
should we just assume that UTF* as an origin charset always means UTF-8 across
the entire application?  we could make the change to nsStandardURL::Init to do
this correction.
(Assignee)

Comment 27

15 years ago
It should be safe to assume all utf* should be encoded by utf8. I will take a
look into this later.

Updated

15 years ago
No longer blocks: 157673

Updated

15 years ago
Depends on: 180372
You need to log in before you can comment on or make changes to this bug.