Closed Bug 97054 Opened 23 years ago Closed 23 years ago

UTF-16 character coding support

Categories

(Core :: Internationalization, defect, P4)

defect

Tracking

()

VERIFIED WORKSFORME
mozilla0.9.9

People

(Reporter: ilya.konstantinov+future, Assigned: shanjian)

Details

(Keywords: intl, meta)

Attachments

(8 files)

As many applications today can use UTF-16 for editing text files, I think we
should add support for this character set (both in LE and BE forms).

Also, it's neccesary to autodetect UCS-2 encoded files, which can be derived
from the initial FF-FE (for LE order) or FE-FF (for BE order). Apparently,
that's what Internet Explorer does (keep in mind it's not possible to get the
<META> tags which specify the encoding *inside* the HTML, and not many authors
would know how to change the HTTP server's headers).
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: intl
QA Contact: andreasb → ylong
assigning to bstell
Assignee: yokoyama → bstell
should this be assigned to Shanjian ?
Sorry Brian.  I would like to assign this to ftang since it is a 
new feature and we need to put this on our development schedule.

== assigning to ftang and changing to All platform/All OS.
Assignee: bstell → ftang
OS: Linux → All
Hardware: PC → All
mark it as m0.9.7
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.7
p4
Priority: -- → P4
give this bug to shanjian to drive the feature. I belive this is a meta bug, we
need to identify other real bug to solve it to support UTF-16. shanjian- mid
priority project. 
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Keywords: meta
Summary: UCS-2 character coding support → UTF-16 character coding support
I think bug 42893 has some implication for this bug. (hmm, 
bugzilla may need 'related-to' relation in addition to 'blocks' and
'depends on' relations).
Last night, I stumbled upon a UTF-16LE encoded web page
(it was Hanja - Chinese character- dictionary in Korea).
They wrote that their pages are in Unicode and I assumed
that they're in UTF-8, but Mozilla can't render it while
MS IE can. It was not until I  saved  the source html file and examined 
it that I realized that it's in UTF-16LE with BOM. I was about
to write to the webmaster of the site that using UTF-16 is a
violation of HTML and (s)he has to convert pages to UTF-8. Before
actually writing that,  
I thought just in case I might as well check
the standard and it turned out that UTF-16 is a valid MIME charset
for html.  That's how I found this bug along with bug 42893.

 As jbetak wrote in his comment to bug 42893 and I've just confirmed
myself, it's trivial to add UTF-16LE and UTF-16BE to view|character encoding
menu and to make them work with actual UTF-16LE/UTF-16BE encoded
web pages(no change in actual code but just a few changes in *properties files)
because the necessary infrastructures are already in place with
the possible exception of automatic detection of endianness.

  Would it be a bad idea to turn on UTF-16LE/UTF-16BE *now*
(because it comes almost free and some web sites are actually
encoded in UTF-16) and
to work on the automatic detection of endianness (with BOM)
and perhaps various transformation formats of Unicode (so that
Mozilla can have Unicode(Auto) for automatic detection
of UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE) *later*? 
Jungshik Shin,
to support UTF16 without support surrogate should be a easy thing to do. Can you 
post a website which is using UTF16 encoding? 
Status: NEW → ASSIGNED
Yes, it's trivial to enable UTF-16 without surrogate support (I've
turned it on in my build). 
As for web pages using UTF-16LE I mentioend, the site seems to be down
at the moment. UTF-16LE part (Chinese Character Dictionary) is 
somewhere under http://ikc.korea.ac.kr/~cnsc, but I can't give you 
the exact URL because the site is down. 
If you just wanna test UTF-16 support on a simple page, I put up
two test pages in UTF-16LE and UTF-16BE at 
http://jshin.net/moztest/css2.utf16le.html and
http://jshin.net/moztest/css2.utf16be.html
 
The web pages I found in UTF-16LE are at
http://ikc.korea.ac.kr/~cnsc/hidb/intro.htm
In the top frame, you may select the middle menu
(which is radical+stroke count index). Then,
in the middle frame, you'll find a list of
radicals. Click on any of radicals and you'll
get the page encoded in UTF-16LE in the bottom
frame. If you click on any of Chinese characters
in the bottom frame, the right frame will show
you some information about the character (Korean
pronunciation, Unicode/ISO 10646 code point,
KS X, GB, CNS, JIS, VN code points, etc) 

 
There doesn't seem to be much need for using any
form of Unicode because they're using 96x96(??)
GIF images to represent Chinese characters
other than radicals. Well, there must be some radicals
not representable in legacy encodings. 
I was just about to enter this bug for Mozilla 0.9.5.  This is very important
that UTF-16 is supported as UTF-16 is apart of the XML 1.0 specification and
must be supported.  Also, Unicode must be support as apart of HTML 4.0.1
specification, but there is no mention of the actual encoding scheme that needs
to be supported or the default encoding scheme; at least I couldn't find it...

I tried Netscape 6.1, and it exhibits the same behavior under Windows, but
somehow works under Mac OS.

This issue should be a show stopper for version 1.0.  2-byte content can get
very large under UTF-8, so UTF-16 is desperately needed...
Please move this as BLOCKER.

I would like to run tests for UTF-16.  I am stopped from using these tests.  

I found a bug under Netscape 6.1 on the Macintosh, where somehow the UTF-16 is 
turned on in the build.  I tried UTF-16 for XML, HTML, CSS-2, and text files.  
It all seems to work well.  However, it seems that CSS files in UTF-16 are 
ignored, but UTF-8 ones work.  I cannot test this CSS "feature" in Mozilla 
because UTF-16 is not available.
move it to 0.9.8, but I will try to resolve it in 0.9.7. 
Target Milestone: mozilla0.9.7 → mozilla0.9.8
Thanks.  Just not that all testing for me is BLOCKED. :'-(
I cannot proceed.  I have pending bugs that I cannot verify.
Jungshik, Joaquin, 
I need to have a test case to go on. As far as I can recall, html parser does 
detect BOM. Even though UTF-16LE, UTF-16BE is not available in charset menu, we 
do support them. The reason not putting them in charset menu is that we don't 
need/want user's intervention here. Since BOM is mandatory for UTF-16XX, we can 
always identify them when they are. Using 
"http://ikc.korea.ac.kr/~cnsc/hidb/under.htm" as an example, UTF-16LE is 
identified and correctly marked in charset menu. 

Joaquin seems suggesting that css in UTF-16XX encoding is ignored. Is UTF-16XX 
allowed to encoding CSS files? If so, could you attach your test case here? We 
probably does not detect UTF-16XX in some of the parser or parser path. If that 
is the case, I will fix it. 
Could you also please provide UTF-16 testcases for XML, please?
This requires the needed CSS file.  It can be opened in Word2k, Notepad under
WinNT/2k, or other program that supports Unicode UTF16.
Other XML and HTML might depend on this.  It can be opened in Word2k, Notepad
under WinNT/2k, or other program that supports Unicode UTF16.
This can test wether CSS works smoothly with UTF-16 document
Some test cases depend on this document.
Target Milestone: mozilla0.9.8 → mozilla0.9.9
I tried all the 6 testcases with recent trunk build on windows, and everything
works perfect. So could someone tell me what is the problem? 
Works for me on Linux too (Mozilla 0.9.8).
Resolve it as WFM.
Reopen the bug if somebody still experience the problem. 
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → WORKSFORME
Verified it as works for me.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: