Closed Bug 12063 Opened 25 years ago Closed 22 years ago

Need Unicode Normalization process...

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

VERIFIED DUPLICATE of bug 8275

People

(Reporter: ftang, Assigned: shanjian)

References

()

Details

(Keywords: helpwanted)

detail unknown. Need it after Beta 1.
Status: NEW → ASSIGNED
Target Milestone: M15
Post Beta 1, Mark M15. Probably should code inside unicharutil....
Whiteboard: Help Wanted
Target Milestone: M15 → M20
Change it to M20
Keywords: helpwanted
Whiteboard: Help Wanted
Won't we need this for things like searching and comparison of Unicode data?
QA Contact: teruko → ftang
We need to make sure that the Unicode data we generate is normalized,
but I we should be able to assume incoming data is already normalized.

According to the W3 "Character Model for the World Wide Web", all the data
received should be normalized according to the Unicode Normalization Form C
(See: http://www.unicode.org/unicode/reports/tr15/tr15-17.html).

See http://www.w3.org/TR/1999/WD-charmod-19991129/#Normalization
   The producer of text data MUST ensure that data is produced or sent out
   in normalized form. For the purpose of W3C specifications and their
   implementations, the producer of text data is the sender of the data in
   the case of protocols. In the case of formats, it is the tool that
   produces the data. 
Normalization Test Suite:
http://www.unicode.org/unicode/reports/tr15/conformance/DraftTestSuite

For those of you who don't have password access to poke around until you
find the right file, the correct URL's are:
http://www.unicode.org/unicode/reports/tr15/conformance/DraftTestSuite.zip
http://www.unicode.org/unicode/reports/tr15/conformance/NormalizerTestSuite.txt
mark as future
Target Milestone: M20 → Future
See http://www.macchiato.com/unicode/normalization_footprint.htm
Normalization Footprint Description
  This document describes how much memory the different normalization forms
  occupy at a minimum (e.g., with an implementation tuned for minimal
  space consumption).

See also http://www.w3.org/TR/charmod
Character Model for the World Wide Web
   http://www.w3.org/TR/charmod/#sec-Normalization
   Section 4: Early Uniform Normalization

   Note: 
   4.3 Responsibility for Normalization

         Producers MUST produce text data in normalized form. For the purpose
         of W3C specifications and their implementations, the producer of text
         data is the sender of the data in the case of protocols and the tool
         that produces the data in the case of formats.

              Note: Implementers of producer software in the above sense are
              encouraged to delegate normalization to their respective data
              sources wherever possible. Examples of data sources are
              operating systems, libraries, and keyboard drivers.

         The recipients of text data MUST assume the data is normalized and
         MUST NOT normalize it. Recipients which transcode text data from a
         legacy encoding to a Unicode encoding form MUST use a
         normalizing-transcoder
Normalization (checking) may become a requirement for XML 1.1:
<URL: http://www.w3.org/TR/2001/WD-xml11-20011213/#sec2.13 >.
Normilization form KC is needed for international domain name support.
http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-03.txt

Nomalization is included in ICU (http://oss.software.ibm.com/icu/).
It uses about 100kb of data file.
Blocks: 112979
Interface proposal.

open issues:
1) used byte count or char count for length arguments?
2) use UTF-16 or UTF-32?
3) should caller allocate out buffer or callee?
4) should this belong to uconv or somewhere else?
5) can we use ICU implementation?

#define NS_ERROR_UNORM_MOREOUTPUT  \
        NS_ERROR_GENERATE_FAILURE(NS_ERROR_MODULE_UCONV, 0x51)

typedef enum {
  kNFD,         // Canonical Decomposition
  kNFC,         // Canonical Decomposition, 
                // followed by Canonical Composition
  kNFKD,        // Compatibility Decomposition
  kNFKC         // Compatibility Decomposition,
                // followed by Canonical Composition
} nsUnicodeNorilizationForm;
 
 /**
 * Normilize Unicode.
 *
 * @param aNormForm    [IN]  Normilization form.
 * @param aSrc         [IN]  A pointer to an input UTF-16 string.
 * @param aSrcLength   [IN]  A length of the input (in 16bit unit).
 * @param aDest        [OUT] A pointer to an output buffer supplied by a caller.
 * @param aDestBuffLength [IN] A length of the caller supplied buffer (in 16bit
unit).
 * @param aDestLength  [OUT] A length of the normilized UTF-16 string (in 16bit
unit).
 * @return             NS_OK for success, 
 *                     NS_ERROR_UNORM_MOREOUTPUT if the supplied out buffer not
large enough.
 */
nsresult NormilizeUnicode(nsUnicodeNorilizationForm aNormForm, 
                          const PRUnichar *aSrc, PRUint32 aSrcLength,
                          PRUnichar *aDest, PRUint32 aDestBuffLength, 
                          PRUint32 *aDestLength);

Target Milestone: Future → ---
shanjian, can you implement a normalizer and compose / decompose code in
mozilla? Maybe we can port the ICU code or write our own.
Assignee: ftang → shanjian
Status: ASSIGNED → NEW
Depends on: 8275
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE

*** This bug has been marked as a duplicate of 8275 ***
No longer blocks: 112979
verified duplicate
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.