Closed Bug 58819 Opened 24 years ago Closed 24 years ago

Non-Latin 1 anchors don't work

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: dbaron, Assigned: nhottanscp)

References

Details

(Keywords: intl, Whiteboard: [rtm-] [r=erik])

Attachments

(9 files)

DESCRIPTION:  Named anchors with cyrillic text aren't working.  I suspect the
problem has something to do with URL escaping -- perhaps we need to unescape the
URL when we go to the anchor?  We should also be careful to check the spec for
correct behavior.

This was pointed out on bug 56285 by Eugene Savitsky.

STEPS TO REPRODUCE (from bug 56285):

1. Load http://www.fcenter.ru/wn/hn02082000.htm
2. Select the second link (the one below 3dfx..., which works fine)

ACTUAL RESULTS
Nothing happens, only the URLbar updates to
http://www.fcenter.ru/wn/hn02082000.htm#%E1%E5%EB%E0%FF

EXPECTED RESULTS:
 * page scrolls down

DOES NOT WORK CORRECTLY ON:
 * Linux, mozilla, opt build of 2000-11-01 tree closure
I can reproduce this when I set character coding to windows-1251.
But it works if character coding is ISO-8859-1.
It might be that charset conversion (document charset to unicode) is missing
after url escape. That could be the reason it works for only ISO-8859-1.

But I have no idea where the escaping is happening for anchors, I mean what
source file, I need help.
Cc to i18n people, this might be more generic issue (e.g. Japaense anchor name).
As far as I can tell, this is a generic issue.
Currently, no anchors are working unless it is interpresed
as ISO-8859-1. Japanese anchors don't work, either.
I changed the summary to reflect the general nature of the
problem. This is a big problem and it will break many
web pages.
Summary: problems going to cyrillic anchors → Non-Latin 1 anchors don't work
Nominating for rtm.
Keywords: rtm
I'm trying to quantify what percentage of web sites are using
non-ASCII anchors. My initial rough estimate is 20% of all non-ASCII
sites and 10% of all web sites. Please correct me with solid info
if anyone has it.
CC'ing a bunch of people whose names appear on the other bug.
Bobj/Msanz - Whatare your thoughts on this bug? Would you consider it a 
stop-ship bug? Remember, the CORE team is at a point of RC1, right now.
In the code, Latin1 is assumed(line 3753).
We should convert the enescaped string from document charset to unicode.

nsDocShell.cpp
nsDocShell::ScrollIfAnchor

3748             char *str = sNewRef.ToNewCString();
3749
3750             // nsUnescape modifies the string that is passed into it.
3751             nsUnescape(str);
3752
3753             sNewRef.AssignWithConversion(str);
3754
3755             nsMemory::Free(str);
3756
3757             rv = shell->GoToAnchor(sNewRef);
I don't know how to get a document charset in docShell.
Cata, do you know anything about that?
Bob, can you mark this [rtm need info]?
We believe non-latin1 anchors are widely used and this would break many pages.
Whiteboard: [rtm need info]
Attached patch patchSplinter Review
The patch works with the Russian page.
Is there Japanese data that I can test?
OS: Linux → All
Hardware: PC → All
There is a description in the HTML spec.

B.2.1 Non-ASCII characters in URI attribute values

http://www.w3.org/TR/html4/appendix/notes.html#idx-anchor
Thanks for adding the UTF-8 conversion code.

There are many places in that file where they check mContentViewer before using
it. Please do the same in your patch.
Added [r=erik].
Whiteboard: [rtm need info] → [rtm need info] [r=erik]
Status: NEW → ASSIGNED
Keywords: 4xp
The patch #3 convers the anchors in the same document.
There are other cases also not working for non ASCII names. Those are the url 
bar, open web location and referenced from othter documents.
nsHTMLContentSink.cpp assumes the string as Latin1 by using AssignWithConversion 
for the input url string (line 3514).

3507   rv = mDocumentURI->QueryInterface(NS_GET_IID(nsIURL), (void**)&url);
3508   if (NS_SUCCEEDED(rv)) {
3509     rv = url->GetRef(&ref);
3510     NS_RELEASE(url);
3511   }
3512   if (rv == NS_OK) {
3513     mRef = new nsString;
3514     mRef->AssignWithConversion(ref);
3515     nsCRT::free(ref);
3516   }
*** Bug 58980 has been marked as a duplicate of this bug. ***
I quickly glanced at the patch and noticed that there's "new nsString" in there
without a null check after the new call, that needs to be fixed. Also, do we
need two string members for the ref in the content sink, couldn't mRef be
replaced with mRefInDocCharset?
sr=waterson for patch#3.
Johnny, I tried to minimized changes for the existing code (to make ASCII case 
unchanged). I think mRefInDocCharset can replace mRef but I don't do that change 
now. I am going to add the null check for "new nsString".
PDT marking [rtm-] because this bug doesn't crash or lose data, and we're out of
time to shake out a big patch.
Whiteboard: [rtm need info] [r=erik] → [rtm-] [r=erik]
Target Milestone: --- → M22
Are we fixing this for the trunk?
Erik, please review patch #6 & #7.
Patch #6, almost same as patch #4, I just added an empty string change.
Patch #7, this is for the case reference outside the document. Basically doing
the same things as nsDocShell.cpp patch (try UTF-8 first then document charset).
Netscape Keyword List page at Netcenter (at the above URL) or
from the drop down list of a Japanese NS6 build uses Japanese
anchors and they are failing. (Added Lynn W.)
Yes. I reported this some time ago to Chau in Bugsplat. I am still waiting for 
her to fix the page.
I talked to erik about patch #6. Since the additional change assumes Latin1 as
an input for special case, it is not following the spec. So we decided to use
patch #4 (got r and sr already).
About patch #7, for external anchor, I will make sure this is consistent with
related bug fix (bug 10373) before I get reviews and check in.
Please remove the redundant mRef... member from the sink before checking in if
you're using patch #4...
>So we decided to use patch #4 (got r and sr already).
My mistake, I meant patch #3.
Patch #4 will not be used (I rewrote it as patch #7).
Checked in patch #3 (fix for internal document case) to the trunk.

I tested patch #7 (fix for external anchor) with Japanese path name (used test
data of bug 10373), like this.
<a href="http://mugen/url/ÆüËܸì/ja_anchors.html#´Á»ú">ÆüËܸì</a>

It opens the correct file and scrolls to the anchor correctly.


Blocks: 60075
What is the problem here? http://www.alion.ru/tech/slovar_cdrw.shtml

The page is on russian, but the anchors are latin. When pressing "? ??????????"
it does not takes tou to the top...
 
In that page, the name is "#top",
<h6><a name="#top">[</a>
It works after I changed href to "##top",
<a href="##top">? ??????????</a>

The spec say NAME must begin with a letter ([A-Za-z])
http://www.w3.org/TR/html4/types.html#type-name

But the link work with 4.x. I am not sure we want to support this in mozilla.

Sorry, I mentioned a wrong place in the spec.
http://www.w3.org/TR/html4/struct/links.html#adef-name-A


I have removed the live Netcenter test case -- to my chagrin,
they fixed the problem by changing Japanese anchors to
ASCII ones.
r=jst for patch #8
Fixed in the trunk.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
I verified this in 2000-12-20-10 Mac Mtrunk, 2000-12-19-04 Win32 Mtrunk, and 2000-12-19-08 Linux Mtrunk builds. 
Status: RESOLVED → VERIFIED
Keywords: intl, nsbeta1
*** Bug 52844 has been marked as a duplicate of this bug. ***
No longer blocks: 60075
*** Bug 62933 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: