Closed Bug 97054 Opened 23 years ago Closed 23 years ago

UTF-16 character coding support

Categories

(Core :: Internationalization, defect, P4)

defect

Tracking

()

VERIFIED WORKSFORME
mozilla0.9.9

People

(Reporter: ilya.konstantinov+future, Assigned: shanjian)

Details

(Keywords: intl, meta)

Attachments

(8 files)

As many applications today can use UTF-16 for editing text files, I think we should add support for this character set (both in LE and BE forms). Also, it's neccesary to autodetect UCS-2 encoded files, which can be derived from the initial FF-FE (for LE order) or FE-FF (for BE order). Apparently, that's what Internet Explorer does (keep in mind it's not possible to get the <META> tags which specify the encoding *inside* the HTML, and not many authors would know how to change the HTTP server's headers).
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: intl
QA Contact: andreasb → ylong
assigning to bstell
Assignee: yokoyama → bstell
should this be assigned to Shanjian ?
Sorry Brian. I would like to assign this to ftang since it is a new feature and we need to put this on our development schedule. == assigning to ftang and changing to All platform/All OS.
Assignee: bstell → ftang
OS: Linux → All
Hardware: PC → All
mark it as m0.9.7
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.7
p4
Priority: -- → P4
give this bug to shanjian to drive the feature. I belive this is a meta bug, we need to identify other real bug to solve it to support UTF-16. shanjian- mid priority project.
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Keywords: meta
Summary: UCS-2 character coding support → UTF-16 character coding support
I think bug 42893 has some implication for this bug. (hmm, bugzilla may need 'related-to' relation in addition to 'blocks' and 'depends on' relations). Last night, I stumbled upon a UTF-16LE encoded web page (it was Hanja - Chinese character- dictionary in Korea). They wrote that their pages are in Unicode and I assumed that they're in UTF-8, but Mozilla can't render it while MS IE can. It was not until I saved the source html file and examined it that I realized that it's in UTF-16LE with BOM. I was about to write to the webmaster of the site that using UTF-16 is a violation of HTML and (s)he has to convert pages to UTF-8. Before actually writing that, I thought just in case I might as well check the standard and it turned out that UTF-16 is a valid MIME charset for html. That's how I found this bug along with bug 42893. As jbetak wrote in his comment to bug 42893 and I've just confirmed myself, it's trivial to add UTF-16LE and UTF-16BE to view|character encoding menu and to make them work with actual UTF-16LE/UTF-16BE encoded web pages(no change in actual code but just a few changes in *properties files) because the necessary infrastructures are already in place with the possible exception of automatic detection of endianness. Would it be a bad idea to turn on UTF-16LE/UTF-16BE *now* (because it comes almost free and some web sites are actually encoded in UTF-16) and to work on the automatic detection of endianness (with BOM) and perhaps various transformation formats of Unicode (so that Mozilla can have Unicode(Auto) for automatic detection of UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE) *later*?
Jungshik Shin, to support UTF16 without support surrogate should be a easy thing to do. Can you post a website which is using UTF16 encoding?
Status: NEW → ASSIGNED
Yes, it's trivial to enable UTF-16 without surrogate support (I've turned it on in my build). As for web pages using UTF-16LE I mentioend, the site seems to be down at the moment. UTF-16LE part (Chinese Character Dictionary) is somewhere under http://ikc.korea.ac.kr/~cnsc, but I can't give you the exact URL because the site is down. If you just wanna test UTF-16 support on a simple page, I put up two test pages in UTF-16LE and UTF-16BE at http://jshin.net/moztest/css2.utf16le.html and http://jshin.net/moztest/css2.utf16be.html
The web pages I found in UTF-16LE are at http://ikc.korea.ac.kr/~cnsc/hidb/intro.htm In the top frame, you may select the middle menu (which is radical+stroke count index). Then, in the middle frame, you'll find a list of radicals. Click on any of radicals and you'll get the page encoded in UTF-16LE in the bottom frame. If you click on any of Chinese characters in the bottom frame, the right frame will show you some information about the character (Korean pronunciation, Unicode/ISO 10646 code point, KS X, GB, CNS, JIS, VN code points, etc) There doesn't seem to be much need for using any form of Unicode because they're using 96x96(??) GIF images to represent Chinese characters other than radicals. Well, there must be some radicals not representable in legacy encodings.
I was just about to enter this bug for Mozilla 0.9.5. This is very important that UTF-16 is supported as UTF-16 is apart of the XML 1.0 specification and must be supported. Also, Unicode must be support as apart of HTML 4.0.1 specification, but there is no mention of the actual encoding scheme that needs to be supported or the default encoding scheme; at least I couldn't find it... I tried Netscape 6.1, and it exhibits the same behavior under Windows, but somehow works under Mac OS. This issue should be a show stopper for version 1.0. 2-byte content can get very large under UTF-8, so UTF-16 is desperately needed...
Please move this as BLOCKER. I would like to run tests for UTF-16. I am stopped from using these tests. I found a bug under Netscape 6.1 on the Macintosh, where somehow the UTF-16 is turned on in the build. I tried UTF-16 for XML, HTML, CSS-2, and text files. It all seems to work well. However, it seems that CSS files in UTF-16 are ignored, but UTF-8 ones work. I cannot test this CSS "feature" in Mozilla because UTF-16 is not available.
move it to 0.9.8, but I will try to resolve it in 0.9.7.
Target Milestone: mozilla0.9.7 → mozilla0.9.8
Thanks. Just not that all testing for me is BLOCKED. :'-( I cannot proceed. I have pending bugs that I cannot verify.
Jungshik, Joaquin, I need to have a test case to go on. As far as I can recall, html parser does detect BOM. Even though UTF-16LE, UTF-16BE is not available in charset menu, we do support them. The reason not putting them in charset menu is that we don't need/want user's intervention here. Since BOM is mandatory for UTF-16XX, we can always identify them when they are. Using "http://ikc.korea.ac.kr/~cnsc/hidb/under.htm" as an example, UTF-16LE is identified and correctly marked in charset menu. Joaquin seems suggesting that css in UTF-16XX encoding is ignored. Is UTF-16XX allowed to encoding CSS files? If so, could you attach your test case here? We probably does not detect UTF-16XX in some of the parser or parser path. If that is the case, I will fix it.
Could you also please provide UTF-16 testcases for XML, please?
This requires the needed CSS file. It can be opened in Word2k, Notepad under WinNT/2k, or other program that supports Unicode UTF16.
Other XML and HTML might depend on this. It can be opened in Word2k, Notepad under WinNT/2k, or other program that supports Unicode UTF16.
This can test wether CSS works smoothly with UTF-16 document
Some test cases depend on this document.
Target Milestone: mozilla0.9.8 → mozilla0.9.9
I tried all the 6 testcases with recent trunk build on windows, and everything works perfect. So could someone tell me what is the problem?
Works for me on Linux too (Mozilla 0.9.8).
Resolve it as WFM. Reopen the bug if somebody still experience the problem.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → WORKSFORME
Verified it as works for me.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: