12063 - Need Unicode Normalization process...

We need to make sure that the Unicode data we generate is normalized, but I we should be able to assume incoming data is already normalized. According to the W3 "Character Model for the World Wide Web", all the data received should be normalized according to the Unicode Normalization Form C (See: http://www.unicode.org/unicode/reports/tr15/tr15-17.html). See http://www.w3.org/TR/1999/WD-charmod-19991129/#Normalization The producer of text data MUST ensure that data is produced or sent out in normalized form. For the purpose of W3C specifications and their implementations, the producer of text data is the sender of the data in the case of protocols. In the case of formats, it is the tool that produces the data.

bobj

Comment 5

•

25 years ago

Normalization charts: http://www.unicode.org/unicode/reports/tr15/NormalizerChart.html http://www.unicode.org/unicode/reports/tr15/instructions.html

bobj

Comment 6

•

25 years ago

Normalization Test Suite: http://www.unicode.org/unicode/reports/tr15/conformance/DraftTestSuite For those of you who don't have password access to poke around until you find the right file, the correct URL's are: http://www.unicode.org/unicode/reports/tr15/conformance/DraftTestSuite.zip http://www.unicode.org/unicode/reports/tr15/conformance/NormalizerTestSuite.txt

Frank Tang

Reporter

Comment 7

•

25 years ago

mark as future

Target Milestone: M20 → Future

bobj

Comment 8

•

25 years ago

See http://www.macchiato.com/unicode/normalization_footprint.htm Normalization Footprint Description This document describes how much memory the different normalization forms occupy at a minimum (e.g., with an implementation tuned for minimal space consumption). See also http://www.w3.org/TR/charmod Character Model for the World Wide Web http://www.w3.org/TR/charmod/#sec-Normalization Section 4: Early Uniform Normalization Note: 4.3 Responsibility for Normalization Producers MUST produce text data in normalized form. For the purpose of W3C specifications and their implementations, the producer of text data is the sender of the data in the case of protocols and the tool that produces the data in the case of formats. Note: Implementers of producer software in the above sense are encouraged to delegate normalization to their respective data sources wherever possible. Examples of data sources are operating systems, libraries, and keyboard drivers. The recipients of text data MUST assume the data is normalized and MUST NOT normalize it. Recipients which transcode text data from a legacy encoding to a Unicode encoding form MUST use a normalizing-transcoder

Karl Ove Hufthammer

Comment 9

•

24 years ago

Normalization (checking) may become a requirement for XML 1.1: <URL: http://www.w3.org/TR/2001/WD-xml11-20011213/#sec2.13 >.

nhottanscp

Comment 10

•

24 years ago

Normilization form KC is needed for international domain name support. http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-03.txt Nomalization is included in ICU (http://oss.software.ibm.com/icu/). It uses about 100kb of data file.

Blocks: 112979

nhottanscp

Comment 11

•

24 years ago

Interface proposal. open issues: 1) used byte count or char count for length arguments? 2) use UTF-16 or UTF-32? 3) should caller allocate out buffer or callee? 4) should this belong to uconv or somewhere else? 5) can we use ICU implementation? #define NS_ERROR_UNORM_MOREOUTPUT \ NS_ERROR_GENERATE_FAILURE(NS_ERROR_MODULE_UCONV, 0x51) typedef enum { kNFD, // Canonical Decomposition kNFC, // Canonical Decomposition, // followed by Canonical Composition kNFKD, // Compatibility Decomposition kNFKC // Compatibility Decomposition, // followed by Canonical Composition } nsUnicodeNorilizationForm; /** * Normilize Unicode. * * @param aNormForm [IN] Normilization form. * @param aSrc [IN] A pointer to an input UTF-16 string. * @param aSrcLength [IN] A length of the input (in 16bit unit). * @param aDest [OUT] A pointer to an output buffer supplied by a caller. * @param aDestBuffLength [IN] A length of the caller supplied buffer (in 16bit unit). * @param aDestLength [OUT] A length of the normilized UTF-16 string (in 16bit unit). * @return NS_OK for success, * NS_ERROR_UNORM_MOREOUTPUT if the supplied out buffer not large enough. */ nsresult NormilizeUnicode(nsUnicodeNorilizationForm aNormForm, const PRUnichar *aSrc, PRUint32 aSrcLength, PRUnichar *aDest, PRUint32 aDestBuffLength, PRUint32 *aDestLength);

Frank Tang

Reporter

Updated

•

23 years ago

Target Milestone: Future → ---

Frank Tang

Reporter

Comment 12

•

23 years ago

shanjian, can you implement a normalizer and compose / decompose code in mozilla? Maybe we can port the ICU code or write our own.

Assignee: ftang → shanjian

Status: ASSIGNED → NEW

nhottanscp

Updated

•

23 years ago

Depends on: 8275

Shanjian Li

Assignee

Updated

•

23 years ago

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → DUPLICATE

Shanjian Li

Assignee

Comment 13

•

23 years ago

*** This bug has been marked as a duplicate of 8275 ***

nhottanscp

Updated

•

23 years ago

No longer blocks: 112979

Simon Paquet [:sipaq]

Comment 14

•

22 years ago

verified duplicate

Status: RESOLVED → VERIFIED

Bugzilla

Need Unicode Normalization process...

Categories

(Core :: Internationalization, defect, P3)

Tracking

()

People

(Reporter: ftang, Assigned: shanjian)

References

(
URL
)

Details

(Keywords: helpwanted)

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Updated

Comment 2

Updated

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Updated

Updated

Comment 13

Updated

Comment 14