Closed Bug 97054 Opened 24 years ago Closed 24 years ago

UTF-16 character coding support

Tracking

()

Status:

VERIFIED WORKSFORME

Milestone:

mozilla0.9.9

People

(Reporter: ilya.konstantinov+future, Assigned: shanjian)

Details

(Keywords: intl, meta)

Attachments

(8 files)

UTF-16 XML document, no CSS 24 years ago Joaquin Menchaca 1.11 KB, text/plain		Details
UTF-16 XML referencing UTF-16 CSS document 24 years ago Joaquin Menchaca 1.45 KB, text/plain		Details
UTF-16 Cascading Style Sheet 24 years ago Joaquin Menchaca 1.94 KB, text/plain		Details
UTF-16 XML referencing UTF-8 CSS document 24 years ago Joaquin Menchaca 1.45 KB, text/plain		Details
UTF-8 Cascading Style Sheet 24 years ago Joaquin Menchaca 1013 bytes, text/plain		Details
UTF-16 HTML document, no CSS 24 years ago Joaquin Menchaca 2.05 KB, text/plain		Details
UTF-16 HTML referencing UTF-16 CSS document 24 years ago Joaquin Menchaca 2.33 KB, text/plain		Details
UTF-16 HTML referencing UTF-8 CSS document 24 years ago Joaquin Menchaca 2.33 KB, text/plain		Details

Ilya Konstantinov

Reporter

Description

•

24 years ago

As many applications today can use UTF-16 for editing text files, I think we should add support for this character set (both in LE and BE forms). Also, it's neccesary to autodetect UCS-2 encoded files, which can be derived from the initial FF-FE (for LE order) or FE-FF (for BE order). Apparently, that's what Internet Explorer does (keep in mind it's not possible to get the <META> tags which specify the encoding *inside* the HTML, and not many authors would know how to change the HTTP server's headers).

Andreas Becker

Updated

•

24 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Keywords: intl

QA Contact: andreasb → ylong

Roy Yokoyama

Comment 1

•

24 years ago

assigning to bstell

Assignee: yokoyama → bstell

kill this account

Comment 2

•

24 years ago

should this be assigned to Shanjian ?

Roy Yokoyama

Comment 3

•

24 years ago

Sorry Brian. I would like to assign this to ftang since it is a new feature and we need to put this on our development schedule. == assigning to ftang and changing to All platform/All OS.

Assignee: bstell → ftang

OS: Linux → All

Hardware: PC → All

Frank Tang

Comment 4

•

24 years ago

mark it as m0.9.7

Status: NEW → ASSIGNED

Target Milestone: --- → mozilla0.9.7

Frank Tang

Comment 5

•

24 years ago

Priority: -- → P4

Frank Tang

Comment 6

•

24 years ago

give this bug to shanjian to drive the feature. I belive this is a meta bug, we need to identify other real bug to solve it to support UTF-16. shanjian- mid priority project.

Assignee: ftang → shanjian

Status: ASSIGNED → NEW

Keywords: meta

Summary: UCS-2 character coding support → UTF-16 character coding support

Jungshik Shin

Comment 7

•

24 years ago

I think bug 42893 has some implication for this bug. (hmm, bugzilla may need 'related-to' relation in addition to 'blocks' and 'depends on' relations). Last night, I stumbled upon a UTF-16LE encoded web page (it was Hanja - Chinese character- dictionary in Korea). They wrote that their pages are in Unicode and I assumed that they're in UTF-8, but Mozilla can't render it while MS IE can. It was not until I saved the source html file and examined it that I realized that it's in UTF-16LE with BOM. I was about to write to the webmaster of the site that using UTF-16 is a violation of HTML and (s)he has to convert pages to UTF-8. Before actually writing that, I thought just in case I might as well check the standard and it turned out that UTF-16 is a valid MIME charset for html. That's how I found this bug along with bug 42893. As jbetak wrote in his comment to bug 42893 and I've just confirmed myself, it's trivial to add UTF-16LE and UTF-16BE to view|character encoding menu and to make them work with actual UTF-16LE/UTF-16BE encoded web pages(no change in actual code but just a few changes in *properties files) because the necessary infrastructures are already in place with the possible exception of automatic detection of endianness. Would it be a bad idea to turn on UTF-16LE/UTF-16BE *now* (because it comes almost free and some web sites are actually encoded in UTF-16) and to work on the automatic detection of endianness (with BOM) and perhaps various transformation formats of Unicode (so that Mozilla can have Unicode(Auto) for automatic detection of UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE) *later*?

Shanjian Li

Assignee

Comment 8

•

24 years ago

Jungshik Shin, to support UTF16 without support surrogate should be a easy thing to do. Can you post a website which is using UTF16 encoding?

Status: NEW → ASSIGNED

Jungshik Shin

Comment 9

•

24 years ago

Yes, it's trivial to enable UTF-16 without surrogate support (I've turned it on in my build). As for web pages using UTF-16LE I mentioend, the site seems to be down at the moment. UTF-16LE part (Chinese Character Dictionary) is somewhere under http://ikc.korea.ac.kr/~cnsc, but I can't give you the exact URL because the site is down. If you just wanna test UTF-16 support on a simple page, I put up two test pages in UTF-16LE and UTF-16BE at http://jshin.net/moztest/css2.utf16le.html and http://jshin.net/moztest/css2.utf16be.html

Jungshik Shin

Comment 10

•

24 years ago

The web pages I found in UTF-16LE are at http://ikc.korea.ac.kr/~cnsc/hidb/intro.htm In the top frame, you may select the middle menu (which is radical+stroke count index). Then, in the middle frame, you'll find a list of radicals. Click on any of radicals and you'll get the page encoded in UTF-16LE in the bottom frame. If you click on any of Chinese characters in the bottom frame, the right frame will show you some information about the character (Korean pronunciation, Unicode/ISO 10646 code point, KS X, GB, CNS, JIS, VN code points, etc) There doesn't seem to be much need for using any form of Unicode because they're using 96x96(??) GIF images to represent Chinese characters other than radicals. Well, there must be some radicals not representable in legacy encodings.

Joaquin Menchaca

•

24 years ago

Jungshik, Joaquin, I need to have a test case to go on. As far as I can recall, html parser does detect BOM. Even though UTF-16LE, UTF-16BE is not available in charset menu, we do support them. The reason not putting them in charset menu is that we don't need/want user's intervention here. Since BOM is mandatory for UTF-16XX, we can always identify them when they are. Using "http://ikc.korea.ac.kr/~cnsc/hidb/under.htm" as an example, UTF-16LE is identified and correctly marked in charset menu. Joaquin seems suggesting that css in UTF-16XX encoding is ignored. Is UTF-16XX allowed to encoding CSS files? If so, could you attach your test case here? We probably does not detect UTF-16XX in some of the parser or parser path. If that is the case, I will fix it.

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 16

•

24 years ago

Could you also please provide UTF-16 testcases for XML, please?

Joaquin Menchaca

•

24 years ago

Attached file UTF-16 HTML document, no CSS — Details

Joaquin Menchaca

•

24 years ago

Resolve it as WFM. Reopen the bug if somebody still experience the problem.

Status: ASSIGNED → RESOLVED

Closed: 24 years ago

Resolution: --- → WORKSFORME

Yuying Long

Comment 28

•

24 years ago

Verified it as works for me.

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.