Closed Bug 58819 Opened 24 years ago Closed 24 years ago

Non-Latin 1 anchors don't work

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: dbaron, Assigned: nhottanscp)

References

Details

(Keywords: intl, Whiteboard: [rtm-] [r=erik])

Attachments

(9 files)

DESCRIPTION: Named anchors with cyrillic text aren't working. I suspect the problem has something to do with URL escaping -- perhaps we need to unescape the URL when we go to the anchor? We should also be careful to check the spec for correct behavior. This was pointed out on bug 56285 by Eugene Savitsky. STEPS TO REPRODUCE (from bug 56285): 1. Load http://www.fcenter.ru/wn/hn02082000.htm 2. Select the second link (the one below 3dfx..., which works fine) ACTUAL RESULTS Nothing happens, only the URLbar updates to http://www.fcenter.ru/wn/hn02082000.htm#%E1%E5%EB%E0%FF EXPECTED RESULTS: * page scrolls down DOES NOT WORK CORRECTLY ON: * Linux, mozilla, opt build of 2000-11-01 tree closure
I can reproduce this when I set character coding to windows-1251. But it works if character coding is ISO-8859-1. It might be that charset conversion (document charset to unicode) is missing after url escape. That could be the reason it works for only ISO-8859-1. But I have no idea where the escaping is happening for anchors, I mean what source file, I need help. Cc to i18n people, this might be more generic issue (e.g. Japaense anchor name).
As far as I can tell, this is a generic issue. Currently, no anchors are working unless it is interpresed as ISO-8859-1. Japanese anchors don't work, either. I changed the summary to reflect the general nature of the problem. This is a big problem and it will break many web pages.
Summary: problems going to cyrillic anchors → Non-Latin 1 anchors don't work
Nominating for rtm.
Keywords: rtm
I'm trying to quantify what percentage of web sites are using non-ASCII anchors. My initial rough estimate is 20% of all non-ASCII sites and 10% of all web sites. Please correct me with solid info if anyone has it.
CC'ing a bunch of people whose names appear on the other bug.
Bobj/Msanz - Whatare your thoughts on this bug? Would you consider it a stop-ship bug? Remember, the CORE team is at a point of RC1, right now.
In the code, Latin1 is assumed(line 3753). We should convert the enescaped string from document charset to unicode. nsDocShell.cpp nsDocShell::ScrollIfAnchor 3748 char *str = sNewRef.ToNewCString(); 3749 3750 // nsUnescape modifies the string that is passed into it. 3751 nsUnescape(str); 3752 3753 sNewRef.AssignWithConversion(str); 3754 3755 nsMemory::Free(str); 3756 3757 rv = shell->GoToAnchor(sNewRef);
I don't know how to get a document charset in docShell. Cata, do you know anything about that?
Bob, can you mark this [rtm need info]?
We believe non-latin1 anchors are widely used and this would break many pages.
Whiteboard: [rtm need info]
Attached patch patchSplinter Review
The patch works with the Russian page. Is there Japanese data that I can test?
OS: Linux → All
Hardware: PC → All
There is a description in the HTML spec. B.2.1 Non-ASCII characters in URI attribute values http://www.w3.org/TR/html4/appendix/notes.html#idx-anchor
Thanks for adding the UTF-8 conversion code. There are many places in that file where they check mContentViewer before using it. Please do the same in your patch.
Added [r=erik].
Whiteboard: [rtm need info] → [rtm need info] [r=erik]
Status: NEW → ASSIGNED
Keywords: 4xp
The patch #3 convers the anchors in the same document. There are other cases also not working for non ASCII names. Those are the url bar, open web location and referenced from othter documents. nsHTMLContentSink.cpp assumes the string as Latin1 by using AssignWithConversion for the input url string (line 3514). 3507 rv = mDocumentURI->QueryInterface(NS_GET_IID(nsIURL), (void**)&url); 3508 if (NS_SUCCEEDED(rv)) { 3509 rv = url->GetRef(&ref); 3510 NS_RELEASE(url); 3511 } 3512 if (rv == NS_OK) { 3513 mRef = new nsString; 3514 mRef->AssignWithConversion(ref); 3515 nsCRT::free(ref); 3516 }
*** Bug 58980 has been marked as a duplicate of this bug. ***
I quickly glanced at the patch and noticed that there's "new nsString" in there without a null check after the new call, that needs to be fixed. Also, do we need two string members for the ref in the content sink, couldn't mRef be replaced with mRefInDocCharset?
sr=waterson for patch#3.
Johnny, I tried to minimized changes for the existing code (to make ASCII case unchanged). I think mRefInDocCharset can replace mRef but I don't do that change now. I am going to add the null check for "new nsString".
PDT marking [rtm-] because this bug doesn't crash or lose data, and we're out of time to shake out a big patch.
Whiteboard: [rtm need info] [r=erik] → [rtm-] [r=erik]
Target Milestone: --- → M22
Are we fixing this for the trunk?
Erik, please review patch #6 & #7. Patch #6, almost same as patch #4, I just added an empty string change. Patch #7, this is for the case reference outside the document. Basically doing the same things as nsDocShell.cpp patch (try UTF-8 first then document charset).
Netscape Keyword List page at Netcenter (at the above URL) or from the drop down list of a Japanese NS6 build uses Japanese anchors and they are failing. (Added Lynn W.)
Yes. I reported this some time ago to Chau in Bugsplat. I am still waiting for her to fix the page.
I talked to erik about patch #6. Since the additional change assumes Latin1 as an input for special case, it is not following the spec. So we decided to use patch #4 (got r and sr already). About patch #7, for external anchor, I will make sure this is consistent with related bug fix (bug 10373) before I get reviews and check in.
Please remove the redundant mRef... member from the sink before checking in if you're using patch #4...
>So we decided to use patch #4 (got r and sr already). My mistake, I meant patch #3. Patch #4 will not be used (I rewrote it as patch #7).
Checked in patch #3 (fix for internal document case) to the trunk. I tested patch #7 (fix for external anchor) with Japanese path name (used test data of bug 10373), like this. <a href="http://mugen/url/ÆüËܸì/ja_anchors.html#´Á»ú">ÆüËܸì</a> It opens the correct file and scrolls to the anchor correctly.
Blocks: 60075
What is the problem here? http://www.alion.ru/tech/slovar_cdrw.shtml The page is on russian, but the anchors are latin. When pressing "? ??????????" it does not takes tou to the top...
In that page, the name is "#top", <h6><a name="#top">[</a> It works after I changed href to "##top", <a href="##top">? ??????????</a> The spec say NAME must begin with a letter ([A-Za-z]) http://www.w3.org/TR/html4/types.html#type-name But the link work with 4.x. I am not sure we want to support this in mozilla.
Sorry, I mentioned a wrong place in the spec. http://www.w3.org/TR/html4/struct/links.html#adef-name-A
I have removed the live Netcenter test case -- to my chagrin, they fixed the problem by changing Japanese anchors to ASCII ones.
r=jst for patch #8
Fixed in the trunk.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
I verified this in 2000-12-20-10 Mac Mtrunk, 2000-12-19-04 Win32 Mtrunk, and 2000-12-19-08 Linux Mtrunk builds.
Status: RESOLVED → VERIFIED
Keywords: intl, nsbeta1
*** Bug 52844 has been marked as a duplicate of this bug. ***
No longer blocks: 60075
*** Bug 62933 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: