Closed Bug 128181 Opened 23 years ago Closed 23 years ago

ncr between 128-159 does not work in html attribute value

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: morse, Assigned: shanjian)

Details

(Keywords: intl, Whiteboard: [ADT3])

Attachments

(8 files, 2 obsolete files)

Bring up the following html. You will notice that the dash (ascii 96) displays fine by itself. But inside a button it becomes either a vertical bar or a question mark, depending on whether it was written as a single 8-bit character or escaped using the ampersand notation. <html> <body> Here's a hyphen as a single ascii character (2D): "-"<br> Here's same hyphen in a button: <input type="button" value=" - "><br> <br> Here's a dash as a single ascii character (96): "–"<br> Here's same dash in a button: <input type="button" value=" – "><br> <br> Here's a dash in escaped notation (ampersand #150;): "&#150;"<br> Here's same dash in a button: <input type="button" value=" &#150; "><br> <body> </html>
Note: Although that the dash character in my sample code above looks identical to the hyphen character, they are really two different ascii codes.
Keywords: nsbeta1
Worked fine on my machine but I can verify that the problem exists on Steve's NT4 machine. madhur, would you be able to test this and see if it's a problem specific to NT4?
BuildId 2002022603 NT4 See the same as Stephen P. Morse.
I just tried this on linux. This time the single 8-bit character display (ascii 96) inside the button was correct. But the ampersand notation inside the button was not correct -- it gave the question mark, just as it did on winNT.
Ok, I do see the problem on my Linux machine. I have no idea why the text would render differently inside a button. Tentatively guessing that this is an i18n issue... can you guys shed some light on why this renders as a question mark? (Changing OS to All)
Assignee: bryner → yokoyama
Component: HTML Form Controls → Internationalization
OS: Windows NT → All
QA Contact: madhur → ruixu
Hardware: PC → All
Here's an even more interesting test case. It generates all ascii characters between decimal 100 ("d") and decimal 168. For each case it shows what the character looks like normally, what it looks like inside a button using a single ascii character code, and what it looks like inside a button using the amersand notation. Note that there doesn't appear to be any pattern as to which characters are displayed correctly, nor is there a pattern as to which incorrect character it is displayed as when it is not displayed correctly. Will attach screen shots. <html> <body> <form> ampersand #100; "&#100;" <input type="button" value=" d "> <input type="button" value=" &#100; "><br> ampersand #101; "&#101;" <input type="button" value=" e "> <input type="button" value=" &#101; "><br> ampersand #102; "&#102;" <input type="button" value=" f "> <input type="button" value=" &#102; "><br> ampersand #103; "&#103;" <input type="button" value=" g "> <input type="button" value=" &#103; "><br> ampersand #104; "&#104;" <input type="button" value=" h "> <input type="button" value=" &#104; "><br> ampersand #105; "&#105;" <input type="button" value=" i "> <input type="button" value=" &#105; "><br> ampersand #106; "&#106;" <input type="button" value=" j "> <input type="button" value=" &#106; "><br> ampersand #107; "&#107;" <input type="button" value=" k "> <input type="button" value=" &#107; "><br> ampersand #108; "&#108;" <input type="button" value=" l "> <input type="button" value=" &#108; "><br> ampersand #109; "&#109;" <input type="button" value=" m "> <input type="button" value=" &#109; "><br> ampersand #110; "&#110;" <input type="button" value=" n "> <input type="button" value=" &#110; "><br> ampersand #111; "&#111;" <input type="button" value=" o "> <input type="button" value=" &#111; "><br> ampersand #112; "&#112;" <input type="button" value=" p "> <input type="button" value=" &#112; "><br> ampersand #113; "&#113;" <input type="button" value=" q "> <input type="button" value=" &#113; "><br> ampersand #114; "&#114;" <input type="button" value=" r "> <input type="button" value=" &#114; "><br> ampersand #115; "&#115;" <input type="button" value=" s "> <input type="button" value=" &#115; "><br> ampersand #116; "&#116;" <input type="button" value=" t "> <input type="button" value=" &#116; "><br> ampersand #117; "&#117;" <input type="button" value=" u "> <input type="button" value=" &#117; "><br> ampersand #118; "&#118;" <input type="button" value=" v "> <input type="button" value=" &#118; "><br> ampersand #119; "&#119;" <input type="button" value=" w "> <input type="button" value=" &#119; "><br> ampersand #120; "&#120;" <input type="button" value=" x "> <input type="button" value=" &#120; "><br> ampersand #121; "&#121;" <input type="button" value=" y "> <input type="button" value=" &#121; "><br> ampersand #122; "&#122;" <input type="button" value=" z "> <input type="button" value=" &#122; "><br> ampersand #123; "&#123;" <input type="button" value=" { "> <input type="button" value=" &#123; "><br> ampersand #124; "&#124;" <input type="button" value=" | "> <input type="button" value=" &#124; "><br> ampersand #125; "&#125;" <input type="button" value=" } "> <input type="button" value=" &#125; "><br> ampersand #126; "&#126;" <input type="button" value=" ~ "> <input type="button" value=" &#126; "><br> ampersand #127; "&#127;" <input type="button" value=" "> <input type="button" value=" &#127; "><br> ampersand #128; "&#128;" <input type="button" value=" € "> <input type="button" value=" &#128; "><br> ampersand #129; "&#129;" <input type="button" value=" � "> <input type="button" value=" &#129; "><br> ampersand #130; "&#130;" <input type="button" value=" ‚ "> <input type="button" value=" &#130; "><br> ampersand #131; "&#131;" <input type="button" value=" ƒ "> <input type="button" value=" &#131; "><br> ampersand #132; "&#132;" <input type="button" value=" „ "> <input type="button" value=" &#132; "><br> ampersand #133; "&#133;" <input type="button" value=" … "> <input type="button" value=" &#133; "><br> ampersand #134; "&#134;" <input type="button" value=" † "> <input type="button" value=" &#134; "><br> ampersand #135; "&#135;" <input type="button" value=" ‡ "> <input type="button" value=" &#135; "><br> ampersand #136; "&#136;" <input type="button" value=" ˆ "> <input type="button" value=" &#136; "><br> ampersand #137; "&#137;" <input type="button" value=" ‰ "> <input type="button" value=" &#137; "><br> ampersand #138; "&#138;" <input type="button" value=" Š "> <input type="button" value=" &#138; "><br> ampersand #139; "&#139;" <input type="button" value=" ‹ "> <input type="button" value=" &#139; "><br> ampersand #140; "&#140;" <input type="button" value=" Œ "> <input type="button" value=" &#140; "><br> ampersand #141; "&#141;" <input type="button" value=" � "> <input type="button" value=" &#141; "><br> ampersand #142; "&#142;" <input type="button" value=" Ž "> <input type="button" value=" &#142; "><br> ampersand #143; "&#143;" <input type="button" value=" � "> <input type="button" value=" &#143; "><br> ampersand #144; "&#144;" <input type="button" value=" � "> <input type="button" value=" &#144; "><br> ampersand #145; "&#145;" <input type="button" value=" ‘ "> <input type="button" value=" &#145; "><br> ampersand #146; "&#146;" <input type="button" value=" ’ "> <input type="button" value=" &#146; "><br> ampersand #147; "&#147;" <input type="button" value=" “ "> <input type="button" value=" &#147; "><br> ampersand #148; "&#148;" <input type="button" value=" ” "> <input type="button" value=" &#148; "><br> ampersand #149; "&#149;" <input type="button" value=" • "> <input type="button" value=" &#149; "><br> ampersand #150; "&#150;" <input type="button" value=" – "> <input type="button" value=" &#150; "><br> ampersand #151; "&#151;" <input type="button" value=" — "> <input type="button" value=" &#151; "><br> ampersand #152; "&#152;" <input type="button" value=" ˜ "> <input type="button" value=" &#152; "><br> ampersand #153; "&#153;" <input type="button" value=" ˜ "> <input type="button" value=" &#153; "><br> ampersand #154; "&#154;" <input type="button" value=" š "> <input type="button" value=" &#154; "><br> ampersand #155; "&#155;" <input type="button" value=" › "> <input type="button" value=" &#155; "><br> ampersand #156; "&#156;" <input type="button" value=" œ "> <input type="button" value=" &#156; "><br> ampersand #157; "&#157;" <input type="button" value=" � "> <input type="button" value=" &#157; "><br> ampersand #158; "&#158;" <input type="button" value=" ž "> <input type="button" value=" &#158; "><br> ampersand #159; "&#159;" <input type="button" value=" Ÿ "> <input type="button" value=" &#159; "><br> ampersand #160; "&#160;" <input type="button" value="   "> <input type="button" value=" &#160; "><br> ampersand #161; "&#161;" <input type="button" value=" ¡ "> <input type="button" value=" &#161; "><br> ampersand #162; "&#162;" <input type="button" value=" ¢ "> <input type="button" value=" &#162; "><br> ampersand #163; "&#163;" <input type="button" value=" £ "> <input type="button" value=" &#163; "><br> ampersand #164; "&#164;" <input type="button" value=" ¤ "> <input type="button" value=" &#164; "><br> ampersand #165; "&#165;" <input type="button" value=" ¥ "> <input type="button" value=" &#165; "><br> ampersand #166; "&#166;" <input type="button" value=" ¦ "> <input type="button" value=" &#166; "><br> ampersand #167; "&#167;" <input type="button" value=" § "> <input type="button" value=" &#167; "><br> ampersand #168; "&#168;" <input type="button" value=" ¨ "> <input type="button" value=" &#168; "><br> </form> <body> </html>
Steve, the HTML testcase would be easier to work with if you could attach it as an HTML attachment, instead of pasting it in a comment.
Keywords: intl
QA Contact: ruixu → ylong
ftang: this is bad; but I need your help on this?
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.0
Status: ASSIGNED → NEW
Keywords: nsbeta1nsbeta1+
&#128; &#133; &#150; &#142; &#149; &#158; have this issue. this are invalid NCR which refer to range 0x80 - 0x9f nsbeta1+
ftang: I am little confused. For example, why #137 is invalid? I could see '0/00' in a plain text; but not in button. http://bugzilla.mozilla.org/attachment.cgi?id=72340&action=view
adt3
Priority: -- → P3
Whiteboard: adt3
I think this is one issue we may consider to cut. give to shanjian since this seems related to text display in xul button.
Assignee: yokoyama → shanjian
The testcase revealed 2 problems, 1) We have 2 different interpretation of NCR, that is incorrect. It also reveals that we have problem with NCR surrogate support too, because surrogate support is only available in one place. 2) We assume all characters in currect code page, from 0 to 255 are available in a raster font. That's incorrect. Different raster fonts varies in glyph set it covers. In almost all raster font for ansi, 127, 128 to 144, 147 to 159 are not available. I need information for varies localized window to confirm this.
Status: NEW → ASSIGNED
I think we should split these two problem. file one bug for raster font issue and file another bug for ncr issue. I think the raster font problem is more important than the ncr issue.
bug 134733 is filed for raster font cmap problem. This bug will be dedicated to NCR problem.
Summary: 8-bit characters don't display properly in buttons → we should have only one NCR parsing path
Attachment #76657 - Attachment is obsolete: true
shanjian- if we don't take the fix of this bug , what may break ? in what condiction we will hit the missing part ?
shanjian- I need impact analysis, in what case the newly added ncr_hack handling will be neded ?
shanjian- I cannot justify this issue is important or not important without detail information. Please tell me what will it break without your fix
shanjian said the problem show when the ncr is in attribute value without his patch shanjian- could you make the patch smaller by 1. not moving the +#define PA_REMAP_128_TO_160_ILLEGAL_NCR 1 + +#ifdef PA_REMAP_128_TO_160_ILLEGAL_NCR +/** + * Map some illegal but commonly used numeric entities into their + * appropriate unicode value. + */ +#define NOT_USED 0xfffd + +static PRUint16 PA_HackTable[] = { + 0x20ac, /* EURO SIGN */ + NOT_USED, + 0x201a, /* SINGLE LOW-9 QUOTATION MARK */ + 0x0192, /* LATIN SMALL LETTER F WITH HOOK */ + 0x201e, /* DOUBLE LOW-9 QUOTATION MARK */ + 0x2026, /* HORIZONTAL ELLIPSIS */ + 0x2020, /* DAGGER */ + 0x2021, /* DOUBLE DAGGER */ + 0x02c6, /* MODIFIER LETTER CIRCUMFLEX ACCENT */ + 0x2030, /* PER MILLE SIGN */ + 0x0160, /* LATIN CAPITAL LETTER S WITH CARON */ + 0x2039, /* SINGLE LEFT-POINTING ANGLE QUOTATION MARK */ + 0x0152, /* LATIN CAPITAL LIGATURE OE */ + NOT_USED, + 0x017D, /* LATIN CAPITAL LETTER Z WITH CARON */ + NOT_USED, + NOT_USED, + 0x2018, /* LEFT SINGLE QUOTATION MARK */ + 0x2019, /* RIGHT SINGLE QUOTATION MARK */ + 0x201c, /* LEFT DOUBLE QUOTATION MARK */ + 0x201d, /* RIGHT DOUBLE QUOTATION MARK */ + 0x2022, /* BULLET */ + 0x2013, /* EN DASH */ + 0x2014, /* EM DASH */ + 0x02dc, /* SMALL TILDE */ + 0x2122, /* TRADE MARK SIGN */ + 0x0161, /* LATIN SMALL LETTER S WITH CARON */ + 0x203a, /* SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */ + 0x0153, /* LATIN SMALL LIGATURE OE */ + NOT_USED, + 0x017E, /* LATIN SMALL LETTER Z WITH CARON */ + 0x0178 /* LATIN CAPITAL LETTER Y WITH DIAERESIS */ +}; +#endif /* PA_REMAP_128_TO_160_ILLEGAL_NCR */ + +#define H_SURROGATE(s) ((PRUnichar)(((PRUint32)s - (PRUint32)0x10000) >> 10) + (PRUnichar)0xd800) +#define L_SURROGATE(s) ((PRUnichar)(((PRUint32)s - (PRUint32)0x10000) & 0x3ff) + (PRUnichar)0xdc00) +#define IS_IN_BMP(ucs4) ((ucs4) < 0x10000) + part, 2. declare only but not implement your static void AppendNCR(nsString& aString, PRInt32 aNCRValue) around line 1549 3. implement your static void AppendNCR(nsString& aString, PRInt32 aNCRValue) after line 2168 in this way, you can probably reduce the size of the patch to about 10 lines of changes.
this is before the change, shanjian, please got sr= and attach image of after the patch
Comment on attachment 77710 [details] [diff] [review] the same patch, but there is no code relocation as suggested by ftang. r=ftang, make sense. please ask for sr=
Attachment #77710 - Flags: review+
it looks like another low risk fix mid impact change to me. This will certainly make our surrogate support better and make sure when user use those hacky ncr in the attribute value (such as in form or alt text), they behave similar to when the ncr is in the content
Attachment #77123 - Attachment is obsolete: true
jst, could you sr=?
Comment on attachment 77710 [details] [diff] [review] the same patch, but there is no code relocation as suggested by ftang. sr=jst
Attachment #77710 - Flags: superreview+
low risk fix. the real impact is unknow, probably not very high. But I think we should consider this for adt1.0.0 from the screenshot
Keywords: adt1.0.0
change subject to "ncr between 128-159 does not work in html attribute value"
Summary: we should have only one NCR parsing path → ncr between 128-159 does not work in html attribute value
Whiteboard: adt3 → [ADT3]
adt1.0.0- Ftang will document the reason.
Keywords: adt1.0.0adt1.0.0-
we decide not to take this one based on the following reason: 1. these ncr are not legal html ncr and usually people won't use it. 2. we have this behavior for several years already and we didn't receive complains from exteranl customer about this at all. Therefore, it is not worthy to take it right now. please check into the trunk after m1.0 branch off.
Comment on attachment 77710 [details] [diff] [review] the same patch, but there is no code relocation as suggested by ftang. a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #77710 - Flags: approval+
fix checked in to trunk. Close this bug.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Um, Sorry... but why was this checked in to the trunk now? Will it make it to 1.0? According to comment #37 from ftang it will not because it is non-standard and nobody has really complained... so... why should we accept non-standard behavior into the trunk, *especially* if we are not going to have this behavior in our biggest, and most important release (1.0)? Accepting this fix in the trunk seems a bit strange -- it only encourages bad behavior in webpages, at the cost of our standards compliance. Answering Roy's question in comment 14, &#137; is invalid because it is not in the HTML specification. For example, HTML defines the per mille symbol as: <!ENTITY permil CDATA "&#8240;" -- per mille sign, U+2030 ISOtech --> and not &#137; see http://www.w3.org/TR/html4/sgml/entities.html for more definitions. We should be looking to remove non-standard behaviour and not add it. Please back this patch out and mark this bug wontfix.
Simply say it is non-standard is not enough reason for me to back out this change. In real life we have to make balance between non-standard and other factors, like the compatibility with old browser. As long as it does not voilate standard, we have to be a little bit forgiven. Besides we don't want keep 2 code path, which give users different behaviors when text appears in text and attribute value. If it is decided that this is really evil, we should back out them all.
> Simply say it is non-standard is not enough reason for me to back out this change. It was enough reason for ftang and ADT to reject this fix for 1.0. If this is an important issue for webmasters, then this fix should also go into the 1.0 branch because that is the important release. Not the trunk. > In real life we have to make balance between non-standard and other factors, like the compatibility with old browser. The fix is _NOT_ compatible with old browsers. Netscape 4.77 does not work with this testcase very well either. Most items display "?" for me and I only see a few of them display anything at all. We add more illegal characters than Netscape 4 displayed. We should at the worst only support what NS 4.7 does. We should not add any extra violations. > As long as it does not voilate standard, we have to be a little bit forgiven. But it _does_ violate the standard... > Besides we don't want keep 2 code path, which give users different behaviors > when text appears in text and attribute value. If it is decided that this is > really evil, we should back out them all. I agree. It is bad that different things happen in attributes and regular text nodes. Should we consider discontinuing support for it in all places? It seems that hardly many people use these characters since if our behavior was "broken" for several years as ftang says (comment #37) and nobody complained, then it could be safe to assume that support for these can be safely removed...
We have the "feature" there for html text for quite some time. The hack was there since 9/22/1998. We just don't have it for attribute value. Personally, I don't have strong opinion for either approach, ie. keep or remove the hack. As long as it is acceptable to users and make our product better, I will do it. I would suggest you file a separate bug about this issue and have discussion there. In this bug, I just want to use a single code path for all NCR parsing. Besides NCR hacking, I added surrogate support as well.
Fixed verified on 04-11trunk build / linux RH7.2. Since we are not going to check this into branch (per comment #37), I'll mark this one as verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: