Closed Bug 74137 Opened 23 years ago Closed 23 years ago

Composer: anchor doesn't accept non-ascii name

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.2

People

(Reporter: ji, Assigned: cmanske)

References

(Depends on 1 open bug)

Details

(Keywords: intl, regression)

Attachments

(6 files)

Please mark this bug as a dup if there is already one for this problem.

The anchor can't accept non-ascii data.
Steps to reproduce:
1. Launch Composer.
2. Enter a string cotaining non-ascii characters in the composer, like ànchor.
3. Highlight the string and click on Anchor button.
The anchor name strips out latin-1 character, it shows up as "nchor"
If the string is Japanese, no anchor name is catched.
4. In the anchor name field, type non-ascii data. The non-ascii chars don't show
after hit Enter to commit.
Is this a new problem or same as 6.0 and 6.01?
Keywords: intl
6.01 doesn't have this problem.
This is a regression.
Keywords: regression
I think Charley changed this recently.
OS: Windows 95 → All
I didn't see this when I checked it on 03-05 trunk build, so must be some change 
after that.
Reassign to cmanske, please check if this is related to your change.
Assignee: nhotta → cmanske
The current implementation allows only "CDATA" strings, according to HTML
4.0 spec. I don't think it's supposed to allow foreign characters, is it?
Can a URL have Japanese characters? If not, then current implementation
is correct, since anchor name becomes part of a URL when referred to in a link.
"CDATA is a sequence of characters from the document character set and may
include character entities."
http://www.w3.org/TR/html4/types.html#type-cdata

So non ASCII is allowed. But other place in the spec recommends UTF-8 to be used
for HREF. I implemented this in mozilla after 6.0 so this is not supported in
6.01 and earier.
http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars

I think not allowing non ASCII is fine. We may want to have an option to allow
non ASCII and use UTF-8 for that case. Cc to momoi, he is responsible for HTML
compatibility issues.

So in case the user selects non ASCII characters only then the anchor edit field
to come up as empty?
What if the user type non ASCII in the field? We need to alert the user and
notify non ASCII cannot be used there.



QA Contact: andreasb → ylong
setting to mozilla0.9.1
Priority: -- → P3
Target Milestone: --- → mozilla0.9.1
> I think not allowing non ASCII is fine. We may want to 
> have an option to allow non ASCII and use UTF-8 for that 
> case. Cc to momoi, he is responsible for HTML 
> compatibility issues.

If we are not allowing non-ASCII characters in
the 'name' attribute, how can a user input non-ASCII
domain names? This will probably become a necessity
not so long from now. 
We can debate whether or not there should be an
UI for the non-ASCII option in the anchor value
but there is a bug here. Composer should be able to
handle non-ASCII anchor values in any case. This way
when the iDN gets implemented, Composer will at least 
not mangle non-ASCII characters. 
We can discourage users from creating "#non-ASCII_string"
type of anchor names internal to the document, but I don't
know if we should not provide basic means for it.
I recommend fixing the non-ASCII handling bug regardless.
This is for anchor name so not really related to iDNS.

So anchors can be none ASCII as I mentioned. Whether we allow it or not in the
editor can be an option. When non ASCII is allowed, HREF should take care of it.

a) Follow HTML 4 spec to generate UTF-8 with percent encoded. I think older
browsers (6.01 and earlier does not recognize this).

b) Use raw 8 bit in HREF which is not recommended in the HTML spec.

c) Use percent encoded name in document charset.

I don't know 4.x generate anchors as b) or c) style.

What I meant to propose in my last comment was to have an option to
allow/disallow non ASCII anchors then use c) which is UTF-8. So if the user
wants to support older browsers the user has to use non ASCII anchors. I prefer
not allow non ASCII as a default to avoid complication.

When editing an existing document, non ASCII characters in anchor should not be
removed unless the user edit the anchor in mozilla editor.

Thanks for the reply.

> I don't know 4.x generate anchors as b) or c) style.

Comm 4.x uses b). 

> When editing an existing document, non ASCII characters in 
> anchor should not be removed unless the user edit the 
> anchor in mozilla editor.

With current builds, editing of an existing non-ASCII anchor 
is not possible due to this bug. 

> What I meant to propose in my last comment was to have an 
> option to allow/disallow non ASCII anchors then use c) 
> which is UTF-8. So if the user wants to support older 
> browsers the user has to use non ASCII anchors. I prefer
> not allow non ASCII as a default to avoid complication.

Can this process be automatic so that the user can
load and edit an existing non-ASCII anchor without
setting the pref? You can disallows creations of new
non-ASCII anchor names.
So can someone help me here? How do I "support non-ascii characters"?
* We can have a bool pref for non ASCII character option in anchor. If non ASCII
is allowed then the editor should not exclude non ASCII characters from editing
and generating anchor.
* For URL in HREF, if the pref allows non ASCII anchor, then convert the string
to UTF-8 otherwise convert to a document charset, apply URL escape regardless of
the pref value.


Answering Momoi san's question, existing non ASCII anchors will be preserved
regardless of the pref value (in UTF-8 or document charset).
Please, no more options/preferences.  We should handle all characters accepted in 
the charset.

Charley--sounds like you'll need to convert the string to the document charset 
and apply URL escaping (I think the latter is described in the HTML4 spec; %-
escaping?).
If the editor has a pref to control standard vs compatibility mode then that
could be used instead. I don't feel good generating format which is not
recommended by the spec.
Here is an example document which uses non ASCII anchor names (from bug 58819).
http://www.fcenter.ru/wn/hn02082000.htm
This document does not escape either in "<A NAME=" and "<A HREF=", so using raw 
8 bit characters in document charset (in windows-1251).

There is another example from also bug 58819.
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=19353
It has HREF parts which uses document charset and URL escaped.
"ja_anchors.html#%C6%FC%CB%DC%B8%EC" 
Also HREF with UTF-8 and URL escaped (as mentioned in the HTML spec).
"ja_anchors.html#%E6%BC%A2%E5%AD%97"

All above cases are supported in the trunk but not in 6.0/6.01 (and that was 
bug 58819).

I cannot find an example which escapes "<A NAME=" part. I think it's because 
that doesn't have to be escaped (you can just use 8bit in a document charset). 
Let me try to create one later and see if that's supported by our browsers.
I tested a document with escaped anchors (e.g. <A NAME="%E0">).
Because an anchor name is not URL, it is not supposed to be URL escaped. If its 
escaped then it is treated as an indivisual entities (e.g. name "%E0" and "à" 
should be recognized as different entities).
So what I told Charley this morning was incorrect. Anchor names should not be 
unescaped. I checked that both 4.x and 6.0, they do not unescape anchors.
Naoki understands this much better than me.
Assignee: cmanske → nhotta
adding nsbeta1 keyword.
Keywords: nsbeta1
adding nsbeta1 keyword.
Before allowing non ASCII input, we should URL escape as Kathy mentioned.
Which file should I look to escape URL in HREF?
Marina, thanks for help!
QA Contact: ylong → marina
re-assigning QA contact
QA Contact: marina → ylong
I think there are two possible places where we can apply URL encoding.

The fist place is after the user types (or selects) the link in the dialog.
We can only use UTF-8 because the target charset is unknown at that point. The 
other issue is that we may have to do the URL encode after multiple UI where the 
user can enter URL (e.g. Image URL).

The other possibility is to change nsDocumentEncoder. It might be possible to 
detect URIs in a document then apply the URL encoding. Currently, the code does 
named entity conversion first then converts to charset. But for URI, it has to 
be converted to charset first then apply URL encode and no entity conversion is 
needed. Also the current code seems to be doing the charset conversion by block, 
not by node bases. Anyway, it would not be a simple change, I am going to 
contact jst if the change is possible.
Naoki--I notice in the EscapeURI method, that you start out Append'ing (rather 
than truncating the string to 0 length or otherwise initializing it).  Is that 
intentional or do you intend that people might call this method intending to 
leverage the Appending (such as adding parts to the uri)?
FYI, I had the same question as brade mentioned above.

In nsHTMLContentSerializer::EscapeURI(), I think that if we execute the code in 
the "if" statement, that documentCharset will be pointing to a buffer that has 
been free'd since the temp created by NS_ConvertUCS2toUTF8() will have gone out 
of scope:


+  if (!uri.IsASCII()) {
+    textToSubURI = do_GetService(NS_ITEXTTOSUBURI_CONTRACTID, &rv);
+    NS_ENSURE_SUCCESS(rv, rv);
+    const PRUnichar *u;
+    mCharSet->GetUnicode(&u);
+    documentCharset = NS_ConvertUCS2toUTF8(u).get();
+  }


Also, I think that if we checked if start >= aURI.Length() after we exit the 
while loop, we could avoid executing all that "// Escape the remaining part." 
code when processing URLs that end with '=', etc.

Not sure if this matters or not, but if this method gets called alot, it may be 
faster to change the code at the end of 
nsHTMLContentSerializer::SerializeAttributes():

+      // Need to escape URI.
+      nsAutoString escapedURI;
+      if (NS_SUCCEEDED(EscapeURI(valueStr, escapedURI)))
+        valueStr = escapedURI;

to copy valueStr into a temp, pass the temp as the first param, and have 
EscapeURI write directly into valueStr. I say this because escaped URI strings 
are sometimes a lot longer than the unescaped version.
Thanks for your comments, I will update.

"write directly into valueStr", not doing this to avoid creating an incomplete
URI which might happen by errors (e.g. charset conversion failure). But I can
restore the temp in case of an error, let me try that.
Status: NEW → ASSIGNED
Kathy, Kin, please review the new patch, thanks.
r=ftang
The code:

+  const char *documentCharset;
+  mCharSet->GetUnicode(&charset);
+  documentCharset = NS_ConvertUCS2toUTF8(charset).get();

is not good, the temporary NS_ConvertUCS2toUTF8() (NS_ConvertUCS2toUTF8 is
actaully a class, not a function) created here only lives long enough for the
assignment to happen but it's not guaranteed to live any longer than that, so
this is a crash waiting to happen. To solve this either make documentCharset an
nsXPIDLCString (which would let you move the assignment into the "if
(!uri.IsASCII())" block), or ensure that the lifetime of the temporary
NS_ConvertUCS2toUTF8 is longer than documentCharset.

Shouldn't this code use the *base* uri (if available) as the base to
NS_MakeAbsoluteURI()?

mCharSet really needs to be an nsCOMPtr, and you need null checks around access
to mCharSet.

Other than that the changes look good, fix that and I'll have one more look.
NS_MakeAbsoluteURI() is used for the following condition, same as the current code.
if (mFlags & nsIDocumentEncoder::OutputAbsoluteLinks)
sr=jst, I filed a new bug on the base uri stuff which I still think is wrong
(and was wrong before this change too), see bug 80081.
Checked in the patch, now I may enable 8 bit characters in the UI.
But I don't know how. Charley, can that be done by flipping a switch or 
something?
Sorry, I don't have a clue! Maybe someone in XPFE group knows?
Charley--are you sure that we aren't restricting (with JS) the characters that 
can go in that field?  That might be what is interfering with what Naoki has 
done.
There is a function ValidateData(),
http://lxr.mozilla.org/seamonkey/source/editor/ui/dialogs/content/EdNamedAnchorP
rops.js#119
that calls ConvertToCDATAString,
http://lxr.mozilla.org/seamonkey/source/editor/ui/dialogs/content/EdDialogCommon
.js#331
329 // Replace whitespace with "_" and allow only HTML CDATA
330 //   characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", and "."
331 function ConvertToCDATAString(string)
332 {
333   return string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:]+/g,'');
334 }

But I don't know where non ASCII characters are filtered out.

Let me ask a different question. NS6 does not restrict input for characters 
greater than 127. What has changed around the anchor dialog since NS6?
The difference is that cmanske added some code to filter out characters that 
were not allowed in CDATA strings (The ConvertToCDATAString() replace call you 
mentioned above) as the user types.

In order to make this work for you, I think we need to make it so that we don't 
filter while the user types at all (except for any CDATA requirements about what 
the first char must be?) since the URI escape code should take care of non CDATA 
chars.

The next thing we need to figure out is how our DOM implementation expects the 
'name' attribute of the anchor node to be stored. Escaped? Unescaped?

If it expects it to be escaped, than the the modification made to the serializer 
should have been unneccessary since the escaping should have been done by the 
anchor dialog code before it was stored in anchor node.
URL escape cannot be done while the data is in unicode (unless we use UTF-8), it
has to be done after we know the target charset.
The rest of the change will be inside editor and I am not familiar with it.
Clear the milestone and reassign to the editor group.
The remaining part is not filtering out characters above 127 in the anchor
property dialog. I fixed HTML seriarizer to escape those character so the UI not
need to filter out those characters.
Assignee: nhotta → beppe
Status: ASSIGNED → NEW
Target Milestone: mozilla0.9.1 → ---
over to charley
Assignee: beppe → cmanske
Target Milestone: --- → mozilla0.9.2
I think this is easy to do by including the characters in the range 128 to 'end'
in the list of characters to NOT filter. What is the largest number in the range
(i.e what is 'end')?
Status: NEW → ASSIGNED
I think we can use 65535 (0xFFFF).
Can you please test the patch to editor's EdDialogCommon.js?
The JS reference says to use "\xnn" for hex character, so I hope "\xFFFF" works
for values above 255.
Okay, I will apply the patch and post the result after I test.
After the patch, I was able to input Latin1 characters (128 - 255) but above 256
are still filtered out.
You can test these characters in US environment by pasting from "Character Map".
That's to bad! That would be a deficiency in JavaScript's RegExp character
support, imho.
Brendan: How can we detect characters above 255?
Setting status. Use last patch unless we hear about other method to get
char > 255 in regexp.
Keywords: patch, review
Whiteboard: FIX IN HAND need r=, sr=
I tried the last patch and it did not filter Japanese character I input.

But I found the data is corrupted somewhere. I put a break point at HTML content
serializer which expects UCS2 but what I got was zero extended string like
0x00E300810082 for one Japanese character.

0012A9E8  23 00 E3 00 81 00 82 00 E3  #.ã...?.ã
0012A9F1  00 81 00 84 00 E3 00 81 00  ...?.ã...
0012A9FA  86 00 00 00 00 00 58 AD 12  ?.....X­.

I think what happens here is something like this. Somewhere in the code, we have
a string in UTF-8 (e.g. 0xE38182), instead of converting to UCS2 it uses
AssignWithConversion, then result would be 0x00E300810082, just like we see in
the dump above.
Note this is nothing to do with the patch. But I didn't see that when I tested
my serializer change using existing non ASCII anchors, so it must be somewhere
in the editor code.
The problem is in nsHTMLAnchorElement.cpp not editor. I file a bug 81090.
Depends on: 81090
Yikes!  Cc'ing rogerl.  JS uses Unicode, including in its regexp implementation.
If you can show a bug where [^a-zA-Z0-9_\.\-\:]+ does not match code points >=
256, please file a bug on the JS engine.  Cc'ing pschwartau for help reducing
the testcase.

The workaround of looping over each character is too expensive to contemplate. 
Please, let's fix this right, ASAP.

/be
Ah, I should have read my bugmail or this bug's comments before commenting. 
Thanks to nhotta, this sounds like a bug in HTML content code, treating UTF-8 as
ASCII.  But it'd be cool to confirm that JS regexp character classes (negated
ones, yet) work with two-byte code points.

cmanske, the klunky fix seems spurious in light of nhotta's finding -- I hope
you don't land it!

/be
My comment on 2001-05-15 17:23,
>Note this is nothing to do with the patch.
Just to clarify. I meant, the patch for EdDialogCommon.js has nothing to do with
the string corruption. The string corruption happens at href creation (that's
81090). Patch for EdDialogCommon.js is needed separately in order to avoid
filtering user's input for non ASCII anchor name.
nhotta: agreed, but the klunky fix (last patch) is not needed (besides being too
slow and based on a fear of a JS regexp unicode bug that does not exist), once
81090 is fixed -- right?

/be
It's needed, bug 81090 fixes a problem of the data loss after "Link Property" 
dialog. The .js patch in this bug is to avoid the data loss in "Anchor Property" 
dialog. There are two separate data losses, so we need two different patches to 
fix both problems.
Brendan: So how hard would it be to extend RegExp to handle full range of
characters ("\xnnnn" instead of "\xnn")?
nhotta: yes, but we don't need the last patch ("Klunky fix") once 81090 is
fixed, I hope -- because JS regexps correctly handle UTF16 code points above
255.  Agreed?

cmanske: regexps already handle full Unicode.  Use \uXXXX to spell Unicode code
points in the UTF16-like encoding that JS uses.

/be
Also see bug 72964:
"String method for pattern matching failed for Chinese Simplified (GB2312)"

There is a reduced HTML testcase attached there:
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=28553

The final section of the testcase uses Unicode escape sequences
(Ex: str = '\u00BF\u00CD\u00BB\u00A7' ) and pattern-matching in JavaScript -
Brendan, do you mean the patch on 05/15/01 15:20 should works once bug 81090 is
fixed? Let me try and I will post result later.
I tested with the patch of ug 81090 and the patch on 05/15/01 15:20.
Japanese characters are still filtered out. But it worked after I changed the
.js file to use \u0080 and \uFFFF instead of \x80 and \xFFFF.

I checked in for a problem of nsHTMLAnchorElement::GetHref, removing the depend.
No longer depends on: 81090
Totally cool! I didn't know about "\uxxxx"! So r=cmanske on that part.
Brendan: does this look good now? Can you please sr=?
Whiteboard: FIX IN HAND need r=, sr= → FIX IN HAND need sr=
nhotta: I don't understand your latest patch here -- the negated character class
excludes all characters not listed after the [^ and before the closing ], so you
don't want to add \u0080-\uFFFF -- that range is already excluded by cmanske's
patch.

/be
Blocks: 81090
No longer blocks: 81090
Depends on: 81090
05/15/01 15:20
string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\x80-\xFFFF]+/g,'');
05/16/01 16:31
string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\u0080-\uFFFF]+/g,'');

I just changed \x80 to \u0080 and \xFFFF to \uFFFF.
The patch works -- I tested it. The higher characters should be included in the
list of negated characters. The filter does: "replace any character that is *not*
one of the specific characters or a char > 128 with a blank", thus *allowing*
the higher characters to be typed.
I checked in the 5/16 patch. So all related issues are resolved now, i.e., we
can marked this fixed?
Whiteboard: FIX IN HAND need sr= → FIX IN HAND
yes, I think so.
fixed
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Keywords: review
Whiteboard: FIX IN HAND
Sorry, I misread the diff and added confusion.  I also didn't know that chars >
127 were legal CDATA chars.  New one on me.

/be
changing QA contact
QA Contact: ylong → marina
based on the 2001-05-17 build i am reopening this bug: not only all non-ascii
chars are stripped in the anchor dlgbox when highlighted but i can not input
them at all... i can enter ascii chars into the anchor dlgbox but everytime i am
attempting to enter non-ascii they are escaped
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
2001-05-17 does not have the fix, please verify with 5/18 build or later, thanks.
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Ok, i thought it was checked into the latest 17th build... the today's build
crashes for me...I'll verify on the next one then
There is a regression caused by this (bug 81238), it double escapes already
escaped URI.
it works using 2001-05-24-04 build, i am able to enter non-ascii strings for 
anchor ( latin-1 and db)
Status: RESOLVED → VERIFIED
Depends on: 1371010
Depends on: 1371559
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: