UTF-16 character coding support

VERIFIED WORKSFORME

Status

()

P4
normal
VERIFIED WORKSFORME
18 years ago
17 years ago

People

(Reporter: ilya.konstantinov+future, Assigned: shanjian)

Tracking

({intl, meta})

Trunk
mozilla0.9.9
intl, meta
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(8 attachments)

(Reporter)

Description

18 years ago
As many applications today can use UTF-16 for editing text files, I think we
should add support for this character set (both in LE and BE forms).

Also, it's neccesary to autodetect UCS-2 encoded files, which can be derived
from the initial FF-FE (for LE order) or FE-FF (for BE order). Apparently,
that's what Internet Explorer does (keep in mind it's not possible to get the
<META> tags which specify the encoding *inside* the HTML, and not many authors
would know how to change the HTTP server's headers).

Updated

18 years ago
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: intl
QA Contact: andreasb → ylong

Comment 1

18 years ago
assigning to bstell
Assignee: yokoyama → bstell

Comment 2

18 years ago
should this be assigned to Shanjian ?

Comment 3

18 years ago
Sorry Brian.  I would like to assign this to ftang since it is a 
new feature and we need to put this on our development schedule.

== assigning to ftang and changing to All platform/All OS.
Assignee: bstell → ftang
OS: Linux → All
Hardware: PC → All

Comment 4

18 years ago
mark it as m0.9.7
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.7

Comment 5

18 years ago
p4
Priority: -- → P4

Comment 6

17 years ago
give this bug to shanjian to drive the feature. I belive this is a meta bug, we
need to identify other real bug to solve it to support UTF-16. shanjian- mid
priority project. 
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Keywords: meta
Summary: UCS-2 character coding support → UTF-16 character coding support

Comment 7

17 years ago
I think bug 42893 has some implication for this bug. (hmm, 
bugzilla may need 'related-to' relation in addition to 'blocks' and
'depends on' relations).
Last night, I stumbled upon a UTF-16LE encoded web page
(it was Hanja - Chinese character- dictionary in Korea).
They wrote that their pages are in Unicode and I assumed
that they're in UTF-8, but Mozilla can't render it while
MS IE can. It was not until I  saved  the source html file and examined 
it that I realized that it's in UTF-16LE with BOM. I was about
to write to the webmaster of the site that using UTF-16 is a
violation of HTML and (s)he has to convert pages to UTF-8. Before
actually writing that,  
I thought just in case I might as well check
the standard and it turned out that UTF-16 is a valid MIME charset
for html.  That's how I found this bug along with bug 42893.

 As jbetak wrote in his comment to bug 42893 and I've just confirmed
myself, it's trivial to add UTF-16LE and UTF-16BE to view|character encoding
menu and to make them work with actual UTF-16LE/UTF-16BE encoded
web pages(no change in actual code but just a few changes in *properties files)
because the necessary infrastructures are already in place with
the possible exception of automatic detection of endianness.

  Would it be a bad idea to turn on UTF-16LE/UTF-16BE *now*
(because it comes almost free and some web sites are actually
encoded in UTF-16) and
to work on the automatic detection of endianness (with BOM)
and perhaps various transformation formats of Unicode (so that
Mozilla can have Unicode(Auto) for automatic detection
of UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE) *later*? 
(Assignee)

Comment 8

17 years ago
Jungshik Shin,
to support UTF16 without support surrogate should be a easy thing to do. Can you 
post a website which is using UTF16 encoding? 
Status: NEW → ASSIGNED

Comment 9

17 years ago
Yes, it's trivial to enable UTF-16 without surrogate support (I've
turned it on in my build). 
As for web pages using UTF-16LE I mentioend, the site seems to be down
at the moment. UTF-16LE part (Chinese Character Dictionary) is 
somewhere under http://ikc.korea.ac.kr/~cnsc, but I can't give you 
the exact URL because the site is down. 
If you just wanna test UTF-16 support on a simple page, I put up
two test pages in UTF-16LE and UTF-16BE at 
http://jshin.net/moztest/css2.utf16le.html and
http://jshin.net/moztest/css2.utf16be.html
 

Comment 10

17 years ago
The web pages I found in UTF-16LE are at
http://ikc.korea.ac.kr/~cnsc/hidb/intro.htm
In the top frame, you may select the middle menu
(which is radical+stroke count index). Then,
in the middle frame, you'll find a list of
radicals. Click on any of radicals and you'll
get the page encoded in UTF-16LE in the bottom
frame. If you click on any of Chinese characters
in the bottom frame, the right frame will show
you some information about the character (Korean
pronunciation, Unicode/ISO 10646 code point,
KS X, GB, CNS, JIS, VN code points, etc) 

 
There doesn't seem to be much need for using any
form of Unicode because they're using 96x96(??)
GIF images to represent Chinese characters
other than radicals. Well, there must be some radicals
not representable in legacy encodings. 

Comment 11

17 years ago
I was just about to enter this bug for Mozilla 0.9.5.  This is very important
that UTF-16 is supported as UTF-16 is apart of the XML 1.0 specification and
must be supported.  Also, Unicode must be support as apart of HTML 4.0.1
specification, but there is no mention of the actual encoding scheme that needs
to be supported or the default encoding scheme; at least I couldn't find it...

I tried Netscape 6.1, and it exhibits the same behavior under Windows, but
somehow works under Mac OS.

This issue should be a show stopper for version 1.0.  2-byte content can get
very large under UTF-8, so UTF-16 is desperately needed...

Comment 12

17 years ago
Please move this as BLOCKER.

I would like to run tests for UTF-16.  I am stopped from using these tests.  

I found a bug under Netscape 6.1 on the Macintosh, where somehow the UTF-16 is 
turned on in the build.  I tried UTF-16 for XML, HTML, CSS-2, and text files.  
It all seems to work well.  However, it seems that CSS files in UTF-16 are 
ignored, but UTF-8 ones work.  I cannot test this CSS "feature" in Mozilla 
because UTF-16 is not available.
(Assignee)

Comment 13

17 years ago
move it to 0.9.8, but I will try to resolve it in 0.9.7. 
Target Milestone: mozilla0.9.7 → mozilla0.9.8

Comment 14

17 years ago
Thanks.  Just not that all testing for me is BLOCKED. :'-(
I cannot proceed.  I have pending bugs that I cannot verify.
(Assignee)

Comment 15

17 years ago
Jungshik, Joaquin, 
I need to have a test case to go on. As far as I can recall, html parser does 
detect BOM. Even though UTF-16LE, UTF-16BE is not available in charset menu, we 
do support them. The reason not putting them in charset menu is that we don't 
need/want user's intervention here. Since BOM is mandatory for UTF-16XX, we can 
always identify them when they are. Using 
"http://ikc.korea.ac.kr/~cnsc/hidb/under.htm" as an example, UTF-16LE is 
identified and correctly marked in charset menu. 

Joaquin seems suggesting that css in UTF-16XX encoding is ignored. Is UTF-16XX 
allowed to encoding CSS files? If so, could you attach your test case here? We 
probably does not detect UTF-16XX in some of the parser or parser path. If that 
is the case, I will fix it. 
Could you also please provide UTF-16 testcases for XML, please?

Comment 17

17 years ago
Created attachment 59597 [details]
UTF-16 XML document, no CSS

Comment 18

17 years ago
Created attachment 59598 [details]
UTF-16 XML referencing UTF-16 CSS document

This requires the needed CSS file.  It can be opened in Word2k, Notepad under
WinNT/2k, or other program that supports Unicode UTF16.

Comment 19

17 years ago
Created attachment 59599 [details]
UTF-16 Cascading Style Sheet

Other XML and HTML might depend on this.  It can be opened in Word2k, Notepad
under WinNT/2k, or other program that supports Unicode UTF16.

Comment 20

17 years ago
Created attachment 59600 [details]
UTF-16 XML referencing UTF-8 CSS document

This can test wether CSS works smoothly with UTF-16 document

Comment 21

17 years ago
Created attachment 59601 [details]
UTF-8 Cascading Style Sheet

Some test cases depend on this document.

Comment 22

17 years ago
Created attachment 59602 [details]
UTF-16 HTML document, no CSS

Comment 23

17 years ago
Created attachment 59604 [details]
UTF-16 HTML referencing UTF-16 CSS document

Comment 24

17 years ago
Created attachment 59605 [details]
UTF-16 HTML referencing UTF-8 CSS document
(Assignee)

Updated

17 years ago
Target Milestone: mozilla0.9.8 → mozilla0.9.9
(Assignee)

Comment 25

17 years ago
I tried all the 6 testcases with recent trunk build on windows, and everything
works perfect. So could someone tell me what is the problem? 
(Reporter)

Comment 26

17 years ago
Works for me on Linux too (Mozilla 0.9.8).
(Assignee)

Comment 27

17 years ago
Resolve it as WFM.
Reopen the bug if somebody still experience the problem. 
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → WORKSFORME

Comment 28

17 years ago
Verified it as works for me.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.