Need Unicode Normalization process...

VERIFIED DUPLICATE of bug 8275

Status

()

Core
Internationalization
P3
normal
VERIFIED DUPLICATE of bug 8275
19 years ago
15 years ago

People

(Reporter: Frank Tang, Assigned: Shanjian Li)

Tracking

({helpwanted})

Trunk
helpwanted
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(URL)

(Reporter)

Description

19 years ago
detail unknown. Need it after Beta 1.
(Reporter)

Updated

19 years ago
Status: NEW → ASSIGNED
Target Milestone: M15
(Reporter)

Comment 1

19 years ago
Post Beta 1, Mark M15. Probably should code inside unicharutil....
(Reporter)

Updated

19 years ago
Whiteboard: Help Wanted
(Reporter)

Updated

19 years ago
Target Milestone: M15 → M20
(Reporter)

Comment 2

19 years ago
Change it to M20

Updated

18 years ago
Keywords: helpwanted
Whiteboard: Help Wanted

Comment 3

18 years ago
Won't we need this for things like searching and comparison of Unicode data?

Updated

18 years ago
QA Contact: teruko → ftang

Comment 4

18 years ago
We need to make sure that the Unicode data we generate is normalized,
but I we should be able to assume incoming data is already normalized.

According to the W3 "Character Model for the World Wide Web", all the data
received should be normalized according to the Unicode Normalization Form C
(See: http://www.unicode.org/unicode/reports/tr15/tr15-17.html).

See http://www.w3.org/TR/1999/WD-charmod-19991129/#Normalization
   The producer of text data MUST ensure that data is produced or sent out
   in normalized form. For the purpose of W3C specifications and their
   implementations, the producer of text data is the sender of the data in
   the case of protocols. In the case of formats, it is the tool that
   produces the data. 

Comment 6

18 years ago
Normalization Test Suite:
http://www.unicode.org/unicode/reports/tr15/conformance/DraftTestSuite

For those of you who don't have password access to poke around until you
find the right file, the correct URL's are:
http://www.unicode.org/unicode/reports/tr15/conformance/DraftTestSuite.zip
http://www.unicode.org/unicode/reports/tr15/conformance/NormalizerTestSuite.txt
(Reporter)

Comment 7

18 years ago
mark as future
Target Milestone: M20 → Future

Comment 8

18 years ago
See http://www.macchiato.com/unicode/normalization_footprint.htm
Normalization Footprint Description
  This document describes how much memory the different normalization forms
  occupy at a minimum (e.g., with an implementation tuned for minimal
  space consumption).

See also http://www.w3.org/TR/charmod
Character Model for the World Wide Web
   http://www.w3.org/TR/charmod/#sec-Normalization
   Section 4: Early Uniform Normalization

   Note: 
   4.3 Responsibility for Normalization

         Producers MUST produce text data in normalized form. For the purpose
         of W3C specifications and their implementations, the producer of text
         data is the sender of the data in the case of protocols and the tool
         that produces the data in the case of formats.

              Note: Implementers of producer software in the above sense are
              encouraged to delegate normalization to their respective data
              sources wherever possible. Examples of data sources are
              operating systems, libraries, and keyboard drivers.

         The recipients of text data MUST assume the data is normalized and
         MUST NOT normalize it. Recipients which transcode text data from a
         legacy encoding to a Unicode encoding form MUST use a
         normalizing-transcoder

Comment 9

17 years ago
Normalization (checking) may become a requirement for XML 1.1:
<URL: http://www.w3.org/TR/2001/WD-xml11-20011213/#sec2.13 >.

Comment 10

16 years ago
Normilization form KC is needed for international domain name support.
http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-03.txt

Nomalization is included in ICU (http://oss.software.ibm.com/icu/).
It uses about 100kb of data file.
Blocks: 112979

Comment 11

16 years ago
Interface proposal.

open issues:
1) used byte count or char count for length arguments?
2) use UTF-16 or UTF-32?
3) should caller allocate out buffer or callee?
4) should this belong to uconv or somewhere else?
5) can we use ICU implementation?

#define NS_ERROR_UNORM_MOREOUTPUT  \
        NS_ERROR_GENERATE_FAILURE(NS_ERROR_MODULE_UCONV, 0x51)

typedef enum {
  kNFD,         // Canonical Decomposition
  kNFC,         // Canonical Decomposition, 
                // followed by Canonical Composition
  kNFKD,        // Compatibility Decomposition
  kNFKC         // Compatibility Decomposition,
                // followed by Canonical Composition
} nsUnicodeNorilizationForm;
 
 /**
 * Normilize Unicode.
 *
 * @param aNormForm    [IN]  Normilization form.
 * @param aSrc         [IN]  A pointer to an input UTF-16 string.
 * @param aSrcLength   [IN]  A length of the input (in 16bit unit).
 * @param aDest        [OUT] A pointer to an output buffer supplied by a caller.
 * @param aDestBuffLength [IN] A length of the caller supplied buffer (in 16bit
unit).
 * @param aDestLength  [OUT] A length of the normilized UTF-16 string (in 16bit
unit).
 * @return             NS_OK for success, 
 *                     NS_ERROR_UNORM_MOREOUTPUT if the supplied out buffer not
large enough.
 */
nsresult NormilizeUnicode(nsUnicodeNorilizationForm aNormForm, 
                          const PRUnichar *aSrc, PRUint32 aSrcLength,
                          PRUnichar *aDest, PRUint32 aDestBuffLength, 
                          PRUint32 *aDestLength);

(Reporter)

Updated

16 years ago
Target Milestone: Future → ---
(Reporter)

Comment 12

16 years ago
shanjian, can you implement a normalizer and compose / decompose code in
mozilla? Maybe we can port the ICU code or write our own.
Assignee: ftang → shanjian
Status: ASSIGNED → NEW

Updated

16 years ago
Depends on: 8275
(Assignee)

Updated

16 years ago
Status: NEW → RESOLVED
Last Resolved: 16 years ago
Resolution: --- → DUPLICATE
(Assignee)

Comment 13

16 years ago

*** This bug has been marked as a duplicate of 8275 ***

Updated

16 years ago
No longer blocks: 112979
verified duplicate
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.