Closed Bug 555940 Opened 15 years ago Closed 15 years ago

[HTML5] Named character names should use Java String and a dedicated type in C++

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Attachments

(1 file)

Make named character names use a dedicated type 15 years ago Henri Sivonen (:hsivonen) 17.96 KB, patch	taras.mozilla : review+ jst : approval2.0+	Details \| Diff \| Splinter Review

Henri Sivonen (:hsivonen)

Assignee

Description

•

15 years ago

Currently, the named character name table in the HTML5 parser uses byte[] in Java and static byte array-backed heap-allocated jArray objects in C++. It would be desirable to use java.lang.String objects in Java (to avoid the added footprint of the byte arrays in Java and GWT-compiled JS) and to use a dedicated struct type that could respond to charAt() and length() in C++ (to avoid the heap-allocated jArray objects).

Henri Sivonen (:hsivonen)

Assignee

Comment 1

•

15 years ago

In C++, the length should probably be baked into the character buffer so that the named character struct looks roughly like this: struct nsHtml5CharacterName { short mOffset; PRInt32 length() { return static_cast<PRInt32> (nsHtml5NamedCharacters::ALL_NAMES[mOffset]); } PRUnichar charAt(PRInt32 index) { return static_cast<PRUnichar> (nsHtml5NamedCharacters::ALL_NAMES[mOffset + 1 + index]); } }

Henri Sivonen (:hsivonen)

Assignee

Comment 2

•

15 years ago

Attached patch Make named character names use a dedicated type — Details — Splinter Review

Chances are the patch won't apply to the trunk. You can get my entire queue from http://hg.mozilla.org/users/hsivonen_iki.fi/patches/file/80f46a1077ac

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Attachment #481184 - Flags: review?(tglek)

(dormant account)

Comment 3

•

15 years ago

Comment on attachment 481184 [details] [diff] [review] Make named character names use a dedicated type ># HG changeset patch ># User Henri Sivonen <hsivonen@iki.fi> ># Date 1285659151 -10800 ># Node ID f3ff660a74c470eb5271b495f84e695f7683893a ># Parent c987796ba739eca3bb573d9621d451a49ed8cf50 >Bug 555940 - Make named character names use a dedicated data type. r=NOT_REVIEWED, a=NOT_APPROVED. > >diff --git a/parser/html/javasrc/Tokenizer.java b/parser/html/javasrc/Tokenizer.java >--- a/parser/html/javasrc/Tokenizer.java >+++ b/parser/html/javasrc/Tokenizer.java >@@ -31,16 +31,17 @@ > * "Â© Copyright 2004-2010 Apple Computer, Inc., Mozilla Foundation, and > * Opera Software ASA. You are granted a license to use, reproduce and > * create derivative works of this document." > */ > > package nu.validator.htmlparser.impl; > > import nu.validator.htmlparser.annotation.Auto; >+import nu.validator.htmlparser.annotation.CharacterName; > import nu.validator.htmlparser.annotation.Const; > import nu.validator.htmlparser.annotation.Inline; > import nu.validator.htmlparser.annotation.Local; > import nu.validator.htmlparser.annotation.NoLength; > import nu.validator.htmlparser.common.EncodingDeclarationHandler; > import nu.validator.htmlparser.common.Interner; > import nu.validator.htmlparser.common.TokenHandler; > import nu.validator.htmlparser.common.XmlViolationPolicy; >@@ -3105,39 +3106,39 @@ public class Tokenizer implements Locato > * identifiers in the first column of the named > * character references table (in a case-sensitive > * manner). > */ > loloop: for (;;) { > if (hi < lo) { > break outer; > } >- if (entCol == NamedCharacters.NAMES[lo].length) { >+ if (entCol == NamedCharacters.NAMES[lo].length()) { > candidate = lo; > strBufMark = strBufLen; > lo++; >- } else if (entCol > NamedCharacters.NAMES[lo].length) { >+ } else if (entCol > NamedCharacters.NAMES[lo].length()) { > break outer; >- } else if (c > NamedCharacters.NAMES[lo][entCol]) { >+ } else if (c > NamedCharacters.NAMES[lo].charAt(entCol)) { > lo++; > } else { > break loloop; > } > } > > hiloop: for (;;) { > if (hi < lo) { > break outer; > } >- if (entCol == NamedCharacters.NAMES[hi].length) { >+ if (entCol == NamedCharacters.NAMES[hi].length()) { > break hiloop; > } >- if (entCol > NamedCharacters.NAMES[hi].length) { >+ if (entCol > NamedCharacters.NAMES[hi].length()) { > break outer; >- } else if (c < NamedCharacters.NAMES[hi][entCol]) { >+ } else if (c < NamedCharacters.NAMES[hi].charAt(entCol)) { > hi--; > } else { > break hiloop; > } > } > > if (hi < lo) { > break outer; >@@ -3157,19 +3158,19 @@ public class Tokenizer implements Locato > if ((returnState & DATA_AND_RCDATA_MASK) == 0) { > cstart = pos; > } > state = transition(state, returnState, reconsume, pos); > reconsume = true; > continue stateloop; > } else { > // c can't be CR, LF or nul if we got here >- byte[] candidateArr = NamedCharacters.NAMES[candidate]; >- if (candidateArr.length == 0 >- || candidateArr[candidateArr.length - 1] != ';') { >+ @Const @CharacterName String candidateName = NamedCharacters.NAMES[candidate]; >+ if (candidateName.length() == 0 >+ || candidateName.charAt(candidateName.length() - 1) != ';') { > /* > * If the last character matched is not a U+003B > * SEMICOLON (;), there is a parse error. > */ > if ((returnState & DATA_AND_RCDATA_MASK) != 0) { > /* > * If the entity is being consumed as part of an > * attribute, and the last character matched is >@@ -6320,63 +6321,62 @@ public class Tokenizer implements Locato > * identifiers in the first column of the named > * character references table (in a case-sensitive > * manner). > */ > hiloop: for (;;) { > if (hi == -1) { > break hiloop; > } >- if (entCol == NamedCharacters.NAMES[hi].length) { >+ if (entCol == NamedCharacters.NAMES[hi].length()) { > break hiloop; > } >- if (entCol > NamedCharacters.NAMES[hi].length) { >+ if (entCol > NamedCharacters.NAMES[hi].length()) { > break outer; >- } else if (c < NamedCharacters.NAMES[hi][entCol]) { >+ } else if (c < NamedCharacters.NAMES[hi].charAt(entCol)) { > hi--; > } else { > break hiloop; > } > } > > loloop: for (;;) { > if (hi < lo) { > break outer; > } >- if (entCol == NamedCharacters.NAMES[lo].length) { >+ if (entCol == NamedCharacters.NAMES[lo].length()) { > candidate = lo; > strBufMark = strBufLen; > lo++; >- } else if (entCol > NamedCharacters.NAMES[lo].length) { >+ } else if (entCol > NamedCharacters.NAMES[lo].length()) { > break outer; >- } else if (c > NamedCharacters.NAMES[lo][entCol]) { >+ } else if (c > NamedCharacters.NAMES[lo].charAt(entCol)) { > lo++; > } else { > break loloop; > } > } > if (hi < lo) { > break outer; > } > continue; > } > >- // TODO warn about apos (IE) and TRADE (Opera) > if (candidate == -1) { > /* > * If no match can be made, then this is a parse error. > */ > errNoNamedCharacterMatch(); > emitOrAppendStrBuf(returnState); > state = returnState; > continue eofloop; > } else { >- byte[] candidateArr = NamedCharacters.NAMES[candidate]; >- if (candidateArr.length == 0 >- || candidateArr[candidateArr.length - 1] != ';') { >+ @Const @CharacterName String candidateName = NamedCharacters.NAMES[candidate]; >+ if (candidateName.length() == 0 >+ || candidateName.charAt(candidateName.length() - 1) != ';') { > /* > * If the last character matched is not a U+003B > * SEMICOLON (;), there is a parse error. > */ > if ((returnState & DATA_AND_RCDATA_MASK) != 0) { > /* > * If the entity is being consumed as part of an > * attribute, and the last character matched is >diff --git a/parser/html/nsHtml5NamedCharacters.cpp b/parser/html/nsHtml5NamedCharacters.cpp >--- a/parser/html/nsHtml5NamedCharacters.cpp >+++ b/parser/html/nsHtml5NamedCharacters.cpp >@@ -25,18 +25,16 @@ > #include "jArray.h" > #include "nscore.h" > #include "nsDebug.h" > #include "prlog.h" > #include "nsMemory.h" > > #include "nsHtml5NamedCharacters.h" > >-jArray<jArray<PRInt8,PRInt32>,PRInt32> nsHtml5NamedCharacters::NAMES; >- > const PRUnichar nsHtml5NamedCharacters::VALUES[][2] = { > #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \ > { VALUE }, > #include "nsHtml5NamedCharactersInclude.h" > #undef NAMED_CHARACTER_REFERENCE > {0, 0} }; > > PRUnichar** nsHtml5NamedCharacters::WINDOWS_1252; >@@ -97,57 +95,49 @@ enum NamePositions { > NAME_##N##_DUMMY, /* automatically one higher than previous */ \ > NAME_##N##_START = NAME_##N##_DUMMY - 1, \ > NAME_##N##_END = NAME_##N##_START + LEN + FLAG, > #include "nsHtml5NamedCharactersInclude.h" > #undef NAMED_CHARACTER_REFERENCE > DUMMY_FINAL_NAME_VALUE > }; > >-#define NAMED_CHARACTERS_COUNT 2138 >- > /* check that the start positions will fit in 16 bits */ > PR_STATIC_ASSERT(NS_ARRAY_LENGTH(ALL_NAMES) < 0x10000); > >-struct NamedCharacterData { >- PRUint16 nameStart; >- PRUint16 nameLen; >-#ifdef DEBUG >- PRInt32 n; >-#endif >-}; >- >-static const NamedCharacterData charData[NAMED_CHARACTERS_COUNT] = { >+const nsHtml5CharacterName nsHtml5NamedCharacters::NAMES[] = { > #ifdef DEBUG > #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \ > { NAME_##N##_START, LEN, N }, > #else > #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \ > { NAME_##N##_START, LEN, }, > #endif > #include "nsHtml5NamedCharactersInclude.h" > #undef NAMED_CHARACTER_REFERENCE > }; > >+PRInt32 >+nsHtml5CharacterName::length() const >+{ >+ return nameLen; >+} >+ >+PRUnichar >+nsHtml5CharacterName::charAt(PRInt32 index) const >+{ >+ return static_cast<PRUnichar> (ALL_NAMES[nameStart + index]); >+} >+ > void > nsHtml5NamedCharacters::initializeStatics() > { >- NAMES = jArray<jArray<PRInt8,PRInt32>,PRInt32>(NAMED_CHARACTERS_COUNT); >- PRInt8* allNames = const_cast<PRInt8*>(ALL_NAMES); >- for (PRInt32 i = 0; i < NAMED_CHARACTERS_COUNT; ++i) { >- const NamedCharacterData &data = charData[i]; >- NS_ABORT_IF_FALSE(data.n == i, >- "index error in nsHtml5NamedCharactersInclude.h"); >- NAMES[i] = jArray<PRInt8,PRInt32>(allNames + data.nameStart, data.nameLen); >- } >- > WINDOWS_1252 = new PRUnichar*[32]; > for (PRInt32 i = 0; i < 32; ++i) { > WINDOWS_1252[i] = (PRUnichar*)&(WINDOWS_1252_DATA[i]); > } > } > > void > nsHtml5NamedCharacters::releaseStatics() > { >- NAMES.release(); > delete[] WINDOWS_1252; > } >diff --git a/parser/html/nsHtml5NamedCharacters.h b/parser/html/nsHtml5NamedCharacters.h >--- a/parser/html/nsHtml5NamedCharacters.h >+++ b/parser/html/nsHtml5NamedCharacters.h >@@ -25,19 +25,29 @@ > > #include "prtypes.h" > #include "jArray.h" > #include "nscore.h" > #include "nsDebug.h" > #include "prlog.h" > #include "nsMemory.h" > >+struct nsHtml5CharacterName { >+ PRUint16 nameStart; >+ PRUint16 nameLen; >+ #ifdef DEBUG >+ PRInt32 n; >+ #endif >+ PRInt32 length() const; >+ PRUnichar charAt(PRInt32 index) const; >+}; >+ > class nsHtml5NamedCharacters > { > public: >- static jArray<jArray<PRInt8,PRInt32>,PRInt32> NAMES; >+ static const nsHtml5CharacterName NAMES[]; > static const PRUnichar VALUES[][2]; > static PRUnichar** WINDOWS_1252; > static void initializeStatics(); > static void releaseStatics(); > }; > > #endif // nsHtml5NamedCharacters_h_ >diff --git a/parser/html/nsHtml5Tokenizer.cpp b/parser/html/nsHtml5Tokenizer.cpp >--- a/parser/html/nsHtml5Tokenizer.cpp >+++ b/parser/html/nsHtml5Tokenizer.cpp >@@ -1505,39 +1505,39 @@ nsHtml5Tokenizer::stateLoop(PRInt32 stat > if (c == '\0') { > NS_HTML5_BREAK(stateloop); > } > entCol++; > for (; ; ) { > if (hi < lo) { > NS_HTML5_BREAK(outer); > } >- if (entCol == nsHtml5NamedCharacters::NAMES[lo].length) { >+ if (entCol == nsHtml5NamedCharacters::NAMES[lo].length()) { > candidate = lo; > strBufMark = strBufLen; > lo++; >- } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length) { >+ } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length()) { > NS_HTML5_BREAK(outer); >- } else if (c > nsHtml5NamedCharacters::NAMES[lo][entCol]) { >+ } else if (c > nsHtml5NamedCharacters::NAMES[lo].charAt(entCol)) { > lo++; > } else { > NS_HTML5_BREAK(loloop); > } > } > loloop_end: ; > for (; ; ) { > if (hi < lo) { > NS_HTML5_BREAK(outer); > } >- if (entCol == nsHtml5NamedCharacters::NAMES[hi].length) { >+ if (entCol == nsHtml5NamedCharacters::NAMES[hi].length()) { > NS_HTML5_BREAK(hiloop); > } >- if (entCol > nsHtml5NamedCharacters::NAMES[hi].length) { >+ if (entCol > nsHtml5NamedCharacters::NAMES[hi].length()) { > NS_HTML5_BREAK(outer); >- } else if (c < nsHtml5NamedCharacters::NAMES[hi][entCol]) { >+ } else if (c < nsHtml5NamedCharacters::NAMES[hi].charAt(entCol)) { > hi--; > } else { > NS_HTML5_BREAK(hiloop); > } > } > hiloop_end: ; > if (hi < lo) { > NS_HTML5_BREAK(outer); >@@ -1551,18 +1551,18 @@ nsHtml5Tokenizer::stateLoop(PRInt32 stat > emitOrAppendStrBuf(returnState); > if (!(returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) { > cstart = pos; > } > state = returnState; > reconsume = PR_TRUE; > NS_HTML5_CONTINUE(stateloop); > } else { >- jArray<PRInt8,PRInt32> candidateArr = nsHtml5NamedCharacters::NAMES[candidate]; >- if (!candidateArr.length || candidateArr[candidateArr.length - 1] != ';') { >+ const nsHtml5CharacterName& candidateName = nsHtml5NamedCharacters::NAMES[candidate]; >+ if (!candidateName.length() || candidateName.charAt(candidateName.length() - 1) != ';') { > if ((returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) { > PRUnichar ch; > if (strBufMark == strBufLen) { > ch = c; > } else { > ch = strBuf[strBufMark]; > } > if (ch == '=' || (ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) { >@@ -3595,39 +3595,39 @@ nsHtml5Tokenizer::eof() > case NS_HTML5TOKENIZER_CHARACTER_REFERENCE_TAIL: { > for (; ; ) { > PRUnichar c = '\0'; > entCol++; > for (; ; ) { > if (hi == -1) { > NS_HTML5_BREAK(hiloop); > } >- if (entCol == nsHtml5NamedCharacters::NAMES[hi].length) { >+ if (entCol == nsHtml5NamedCharacters::NAMES[hi].length()) { > NS_HTML5_BREAK(hiloop); > } >- if (entCol > nsHtml5NamedCharacters::NAMES[hi].length) { >+ if (entCol > nsHtml5NamedCharacters::NAMES[hi].length()) { > NS_HTML5_BREAK(outer); >- } else if (c < nsHtml5NamedCharacters::NAMES[hi][entCol]) { >+ } else if (c < nsHtml5NamedCharacters::NAMES[hi].charAt(entCol)) { > hi--; > } else { > NS_HTML5_BREAK(hiloop); > } > } > hiloop_end: ; > for (; ; ) { > if (hi < lo) { > NS_HTML5_BREAK(outer); > } >- if (entCol == nsHtml5NamedCharacters::NAMES[lo].length) { >+ if (entCol == nsHtml5NamedCharacters::NAMES[lo].length()) { > candidate = lo; > strBufMark = strBufLen; > lo++; >- } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length) { >+ } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length()) { > NS_HTML5_BREAK(outer); >- } else if (c > nsHtml5NamedCharacters::NAMES[lo][entCol]) { >+ } else if (c > nsHtml5NamedCharacters::NAMES[lo].charAt(entCol)) { > lo++; > } else { > NS_HTML5_BREAK(loloop); > } > } > loloop_end: ; > if (hi < lo) { > NS_HTML5_BREAK(outer); >@@ -3636,18 +3636,18 @@ nsHtml5Tokenizer::eof() > } > outer_end: ; > if (candidate == -1) { > > emitOrAppendStrBuf(returnState); > state = returnState; > NS_HTML5_CONTINUE(eofloop); > } else { >- jArray<PRInt8,PRInt32> candidateArr = nsHtml5NamedCharacters::NAMES[candidate]; >- if (!candidateArr.length || candidateArr[candidateArr.length - 1] != ';') { >+ const nsHtml5CharacterName& candidateName = nsHtml5NamedCharacters::NAMES[candidate]; >+ if (!candidateName.length() || candidateName.charAt(candidateName.length() - 1) != ';') { > if ((returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) { > PRUnichar ch; > if (strBufMark == strBufLen) { > ch = '\0'; > } else { > ch = strBuf[strBufMark]; > } > if ((ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) {

Attachment #481184 - Flags: review?(tglek) → review+

Henri Sivonen (:hsivonen)

Assignee

Updated

•

15 years ago

Blocks: 502176

Henri Sivonen (:hsivonen)

Assignee

Comment 4

•

15 years ago

Comment on attachment 481184 [details] [diff] [review] Make named character names use a dedicated type Requesting approval, because this patch together with the patch for bug 502176 would advance the goal of bug 569629, i.e. getting rid of static initializers on the startup path.

Attachment #481184 - Flags: approval2.0?

Johnny Stenback (:jst)

Updated

•

15 years ago

Attachment #481184 - Flags: approval2.0? → approval2.0+

Henri Sivonen (:hsivonen)

Assignee

Comment 5

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/9623c2032948

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Henri Sivonen (:hsivonen)

Assignee

Comment 6

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/9623c2032948

Status: ASSIGNED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

[HTML5] Named character names should use Java String and a dedicated type in C++

Categories

(Core :: DOM: HTML Parser, defect, P4)

Tracking

()

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Updated

Comment 5

Comment 6

Attachment

General

Description

File Name

Content Type