Closed
Bug 555940
Opened 14 years ago
Closed 14 years ago
[HTML5] Named character names should use Java String and a dedicated type in C++
Categories
(Core :: DOM: HTML Parser, defect, P4)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
FIXED
People
(Reporter: hsivonen, Assigned: hsivonen)
References
Details
Attachments
(1 file)
17.96 KB,
patch
|
taras.mozilla
:
review+
jst
:
approval2.0+
|
Details | Diff | Splinter Review |
Currently, the named character name table in the HTML5 parser uses byte[] in Java and static byte array-backed heap-allocated jArray objects in C++. It would be desirable to use java.lang.String objects in Java (to avoid the added footprint of the byte arrays in Java and GWT-compiled JS) and to use a dedicated struct type that could respond to charAt() and length() in C++ (to avoid the heap-allocated jArray objects).
Assignee | ||
Comment 1•14 years ago
|
||
In C++, the length should probably be baked into the character buffer so that the named character struct looks roughly like this: struct nsHtml5CharacterName { short mOffset; PRInt32 length() { return static_cast<PRInt32> (nsHtml5NamedCharacters::ALL_NAMES[mOffset]); } PRUnichar charAt(PRInt32 index) { return static_cast<PRUnichar> (nsHtml5NamedCharacters::ALL_NAMES[mOffset + 1 + index]); } }
Assignee | ||
Comment 2•14 years ago
|
||
Chances are the patch won't apply to the trunk. You can get my entire queue from http://hg.mozilla.org/users/hsivonen_iki.fi/patches/file/80f46a1077ac
Comment 3•14 years ago
|
||
Comment on attachment 481184 [details] [diff] [review] Make named character names use a dedicated type ># HG changeset patch ># User Henri Sivonen <hsivonen@iki.fi> ># Date 1285659151 -10800 ># Node ID f3ff660a74c470eb5271b495f84e695f7683893a ># Parent c987796ba739eca3bb573d9621d451a49ed8cf50 >Bug 555940 - Make named character names use a dedicated data type. r=NOT_REVIEWED, a=NOT_APPROVED. > >diff --git a/parser/html/javasrc/Tokenizer.java b/parser/html/javasrc/Tokenizer.java >--- a/parser/html/javasrc/Tokenizer.java >+++ b/parser/html/javasrc/Tokenizer.java >@@ -31,16 +31,17 @@ > * "© Copyright 2004-2010 Apple Computer, Inc., Mozilla Foundation, and > * Opera Software ASA. You are granted a license to use, reproduce and > * create derivative works of this document." > */ > > package nu.validator.htmlparser.impl; > > import nu.validator.htmlparser.annotation.Auto; >+import nu.validator.htmlparser.annotation.CharacterName; > import nu.validator.htmlparser.annotation.Const; > import nu.validator.htmlparser.annotation.Inline; > import nu.validator.htmlparser.annotation.Local; > import nu.validator.htmlparser.annotation.NoLength; > import nu.validator.htmlparser.common.EncodingDeclarationHandler; > import nu.validator.htmlparser.common.Interner; > import nu.validator.htmlparser.common.TokenHandler; > import nu.validator.htmlparser.common.XmlViolationPolicy; >@@ -3105,39 +3106,39 @@ public class Tokenizer implements Locato > * identifiers in the first column of the named > * character references table (in a case-sensitive > * manner). > */ > loloop: for (;;) { > if (hi < lo) { > break outer; > } >- if (entCol == NamedCharacters.NAMES[lo].length) { >+ if (entCol == NamedCharacters.NAMES[lo].length()) { > candidate = lo; > strBufMark = strBufLen; > lo++; >- } else if (entCol > NamedCharacters.NAMES[lo].length) { >+ } else if (entCol > NamedCharacters.NAMES[lo].length()) { > break outer; >- } else if (c > NamedCharacters.NAMES[lo][entCol]) { >+ } else if (c > NamedCharacters.NAMES[lo].charAt(entCol)) { > lo++; > } else { > break loloop; > } > } > > hiloop: for (;;) { > if (hi < lo) { > break outer; > } >- if (entCol == NamedCharacters.NAMES[hi].length) { >+ if (entCol == NamedCharacters.NAMES[hi].length()) { > break hiloop; > } >- if (entCol > NamedCharacters.NAMES[hi].length) { >+ if (entCol > NamedCharacters.NAMES[hi].length()) { > break outer; >- } else if (c < NamedCharacters.NAMES[hi][entCol]) { >+ } else if (c < NamedCharacters.NAMES[hi].charAt(entCol)) { > hi--; > } else { > break hiloop; > } > } > > if (hi < lo) { > break outer; >@@ -3157,19 +3158,19 @@ public class Tokenizer implements Locato > if ((returnState & DATA_AND_RCDATA_MASK) == 0) { > cstart = pos; > } > state = transition(state, returnState, reconsume, pos); > reconsume = true; > continue stateloop; > } else { > // c can't be CR, LF or nul if we got here >- byte[] candidateArr = NamedCharacters.NAMES[candidate]; >- if (candidateArr.length == 0 >- || candidateArr[candidateArr.length - 1] != ';') { >+ @Const @CharacterName String candidateName = NamedCharacters.NAMES[candidate]; >+ if (candidateName.length() == 0 >+ || candidateName.charAt(candidateName.length() - 1) != ';') { > /* > * If the last character matched is not a U+003B > * SEMICOLON (;), there is a parse error. > */ > if ((returnState & DATA_AND_RCDATA_MASK) != 0) { > /* > * If the entity is being consumed as part of an > * attribute, and the last character matched is >@@ -6320,63 +6321,62 @@ public class Tokenizer implements Locato > * identifiers in the first column of the named > * character references table (in a case-sensitive > * manner). > */ > hiloop: for (;;) { > if (hi == -1) { > break hiloop; > } >- if (entCol == NamedCharacters.NAMES[hi].length) { >+ if (entCol == NamedCharacters.NAMES[hi].length()) { > break hiloop; > } >- if (entCol > NamedCharacters.NAMES[hi].length) { >+ if (entCol > NamedCharacters.NAMES[hi].length()) { > break outer; >- } else if (c < NamedCharacters.NAMES[hi][entCol]) { >+ } else if (c < NamedCharacters.NAMES[hi].charAt(entCol)) { > hi--; > } else { > break hiloop; > } > } > > loloop: for (;;) { > if (hi < lo) { > break outer; > } >- if (entCol == NamedCharacters.NAMES[lo].length) { >+ if (entCol == NamedCharacters.NAMES[lo].length()) { > candidate = lo; > strBufMark = strBufLen; > lo++; >- } else if (entCol > NamedCharacters.NAMES[lo].length) { >+ } else if (entCol > NamedCharacters.NAMES[lo].length()) { > break outer; >- } else if (c > NamedCharacters.NAMES[lo][entCol]) { >+ } else if (c > NamedCharacters.NAMES[lo].charAt(entCol)) { > lo++; > } else { > break loloop; > } > } > if (hi < lo) { > break outer; > } > continue; > } > >- // TODO warn about apos (IE) and TRADE (Opera) > if (candidate == -1) { > /* > * If no match can be made, then this is a parse error. > */ > errNoNamedCharacterMatch(); > emitOrAppendStrBuf(returnState); > state = returnState; > continue eofloop; > } else { >- byte[] candidateArr = NamedCharacters.NAMES[candidate]; >- if (candidateArr.length == 0 >- || candidateArr[candidateArr.length - 1] != ';') { >+ @Const @CharacterName String candidateName = NamedCharacters.NAMES[candidate]; >+ if (candidateName.length() == 0 >+ || candidateName.charAt(candidateName.length() - 1) != ';') { > /* > * If the last character matched is not a U+003B > * SEMICOLON (;), there is a parse error. > */ > if ((returnState & DATA_AND_RCDATA_MASK) != 0) { > /* > * If the entity is being consumed as part of an > * attribute, and the last character matched is >diff --git a/parser/html/nsHtml5NamedCharacters.cpp b/parser/html/nsHtml5NamedCharacters.cpp >--- a/parser/html/nsHtml5NamedCharacters.cpp >+++ b/parser/html/nsHtml5NamedCharacters.cpp >@@ -25,18 +25,16 @@ > #include "jArray.h" > #include "nscore.h" > #include "nsDebug.h" > #include "prlog.h" > #include "nsMemory.h" > > #include "nsHtml5NamedCharacters.h" > >-jArray<jArray<PRInt8,PRInt32>,PRInt32> nsHtml5NamedCharacters::NAMES; >- > const PRUnichar nsHtml5NamedCharacters::VALUES[][2] = { > #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \ > { VALUE }, > #include "nsHtml5NamedCharactersInclude.h" > #undef NAMED_CHARACTER_REFERENCE > {0, 0} }; > > PRUnichar** nsHtml5NamedCharacters::WINDOWS_1252; >@@ -97,57 +95,49 @@ enum NamePositions { > NAME_##N##_DUMMY, /* automatically one higher than previous */ \ > NAME_##N##_START = NAME_##N##_DUMMY - 1, \ > NAME_##N##_END = NAME_##N##_START + LEN + FLAG, > #include "nsHtml5NamedCharactersInclude.h" > #undef NAMED_CHARACTER_REFERENCE > DUMMY_FINAL_NAME_VALUE > }; > >-#define NAMED_CHARACTERS_COUNT 2138 >- > /* check that the start positions will fit in 16 bits */ > PR_STATIC_ASSERT(NS_ARRAY_LENGTH(ALL_NAMES) < 0x10000); > >-struct NamedCharacterData { >- PRUint16 nameStart; >- PRUint16 nameLen; >-#ifdef DEBUG >- PRInt32 n; >-#endif >-}; >- >-static const NamedCharacterData charData[NAMED_CHARACTERS_COUNT] = { >+const nsHtml5CharacterName nsHtml5NamedCharacters::NAMES[] = { > #ifdef DEBUG > #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \ > { NAME_##N##_START, LEN, N }, > #else > #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \ > { NAME_##N##_START, LEN, }, > #endif > #include "nsHtml5NamedCharactersInclude.h" > #undef NAMED_CHARACTER_REFERENCE > }; > >+PRInt32 >+nsHtml5CharacterName::length() const >+{ >+ return nameLen; >+} >+ >+PRUnichar >+nsHtml5CharacterName::charAt(PRInt32 index) const >+{ >+ return static_cast<PRUnichar> (ALL_NAMES[nameStart + index]); >+} >+ > void > nsHtml5NamedCharacters::initializeStatics() > { >- NAMES = jArray<jArray<PRInt8,PRInt32>,PRInt32>(NAMED_CHARACTERS_COUNT); >- PRInt8* allNames = const_cast<PRInt8*>(ALL_NAMES); >- for (PRInt32 i = 0; i < NAMED_CHARACTERS_COUNT; ++i) { >- const NamedCharacterData &data = charData[i]; >- NS_ABORT_IF_FALSE(data.n == i, >- "index error in nsHtml5NamedCharactersInclude.h"); >- NAMES[i] = jArray<PRInt8,PRInt32>(allNames + data.nameStart, data.nameLen); >- } >- > WINDOWS_1252 = new PRUnichar*[32]; > for (PRInt32 i = 0; i < 32; ++i) { > WINDOWS_1252[i] = (PRUnichar*)&(WINDOWS_1252_DATA[i]); > } > } > > void > nsHtml5NamedCharacters::releaseStatics() > { >- NAMES.release(); > delete[] WINDOWS_1252; > } >diff --git a/parser/html/nsHtml5NamedCharacters.h b/parser/html/nsHtml5NamedCharacters.h >--- a/parser/html/nsHtml5NamedCharacters.h >+++ b/parser/html/nsHtml5NamedCharacters.h >@@ -25,19 +25,29 @@ > > #include "prtypes.h" > #include "jArray.h" > #include "nscore.h" > #include "nsDebug.h" > #include "prlog.h" > #include "nsMemory.h" > >+struct nsHtml5CharacterName { >+ PRUint16 nameStart; >+ PRUint16 nameLen; >+ #ifdef DEBUG >+ PRInt32 n; >+ #endif >+ PRInt32 length() const; >+ PRUnichar charAt(PRInt32 index) const; >+}; >+ > class nsHtml5NamedCharacters > { > public: >- static jArray<jArray<PRInt8,PRInt32>,PRInt32> NAMES; >+ static const nsHtml5CharacterName NAMES[]; > static const PRUnichar VALUES[][2]; > static PRUnichar** WINDOWS_1252; > static void initializeStatics(); > static void releaseStatics(); > }; > > #endif // nsHtml5NamedCharacters_h_ >diff --git a/parser/html/nsHtml5Tokenizer.cpp b/parser/html/nsHtml5Tokenizer.cpp >--- a/parser/html/nsHtml5Tokenizer.cpp >+++ b/parser/html/nsHtml5Tokenizer.cpp >@@ -1505,39 +1505,39 @@ nsHtml5Tokenizer::stateLoop(PRInt32 stat > if (c == '\0') { > NS_HTML5_BREAK(stateloop); > } > entCol++; > for (; ; ) { > if (hi < lo) { > NS_HTML5_BREAK(outer); > } >- if (entCol == nsHtml5NamedCharacters::NAMES[lo].length) { >+ if (entCol == nsHtml5NamedCharacters::NAMES[lo].length()) { > candidate = lo; > strBufMark = strBufLen; > lo++; >- } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length) { >+ } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length()) { > NS_HTML5_BREAK(outer); >- } else if (c > nsHtml5NamedCharacters::NAMES[lo][entCol]) { >+ } else if (c > nsHtml5NamedCharacters::NAMES[lo].charAt(entCol)) { > lo++; > } else { > NS_HTML5_BREAK(loloop); > } > } > loloop_end: ; > for (; ; ) { > if (hi < lo) { > NS_HTML5_BREAK(outer); > } >- if (entCol == nsHtml5NamedCharacters::NAMES[hi].length) { >+ if (entCol == nsHtml5NamedCharacters::NAMES[hi].length()) { > NS_HTML5_BREAK(hiloop); > } >- if (entCol > nsHtml5NamedCharacters::NAMES[hi].length) { >+ if (entCol > nsHtml5NamedCharacters::NAMES[hi].length()) { > NS_HTML5_BREAK(outer); >- } else if (c < nsHtml5NamedCharacters::NAMES[hi][entCol]) { >+ } else if (c < nsHtml5NamedCharacters::NAMES[hi].charAt(entCol)) { > hi--; > } else { > NS_HTML5_BREAK(hiloop); > } > } > hiloop_end: ; > if (hi < lo) { > NS_HTML5_BREAK(outer); >@@ -1551,18 +1551,18 @@ nsHtml5Tokenizer::stateLoop(PRInt32 stat > emitOrAppendStrBuf(returnState); > if (!(returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) { > cstart = pos; > } > state = returnState; > reconsume = PR_TRUE; > NS_HTML5_CONTINUE(stateloop); > } else { >- jArray<PRInt8,PRInt32> candidateArr = nsHtml5NamedCharacters::NAMES[candidate]; >- if (!candidateArr.length || candidateArr[candidateArr.length - 1] != ';') { >+ const nsHtml5CharacterName& candidateName = nsHtml5NamedCharacters::NAMES[candidate]; >+ if (!candidateName.length() || candidateName.charAt(candidateName.length() - 1) != ';') { > if ((returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) { > PRUnichar ch; > if (strBufMark == strBufLen) { > ch = c; > } else { > ch = strBuf[strBufMark]; > } > if (ch == '=' || (ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) { >@@ -3595,39 +3595,39 @@ nsHtml5Tokenizer::eof() > case NS_HTML5TOKENIZER_CHARACTER_REFERENCE_TAIL: { > for (; ; ) { > PRUnichar c = '\0'; > entCol++; > for (; ; ) { > if (hi == -1) { > NS_HTML5_BREAK(hiloop); > } >- if (entCol == nsHtml5NamedCharacters::NAMES[hi].length) { >+ if (entCol == nsHtml5NamedCharacters::NAMES[hi].length()) { > NS_HTML5_BREAK(hiloop); > } >- if (entCol > nsHtml5NamedCharacters::NAMES[hi].length) { >+ if (entCol > nsHtml5NamedCharacters::NAMES[hi].length()) { > NS_HTML5_BREAK(outer); >- } else if (c < nsHtml5NamedCharacters::NAMES[hi][entCol]) { >+ } else if (c < nsHtml5NamedCharacters::NAMES[hi].charAt(entCol)) { > hi--; > } else { > NS_HTML5_BREAK(hiloop); > } > } > hiloop_end: ; > for (; ; ) { > if (hi < lo) { > NS_HTML5_BREAK(outer); > } >- if (entCol == nsHtml5NamedCharacters::NAMES[lo].length) { >+ if (entCol == nsHtml5NamedCharacters::NAMES[lo].length()) { > candidate = lo; > strBufMark = strBufLen; > lo++; >- } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length) { >+ } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length()) { > NS_HTML5_BREAK(outer); >- } else if (c > nsHtml5NamedCharacters::NAMES[lo][entCol]) { >+ } else if (c > nsHtml5NamedCharacters::NAMES[lo].charAt(entCol)) { > lo++; > } else { > NS_HTML5_BREAK(loloop); > } > } > loloop_end: ; > if (hi < lo) { > NS_HTML5_BREAK(outer); >@@ -3636,18 +3636,18 @@ nsHtml5Tokenizer::eof() > } > outer_end: ; > if (candidate == -1) { > > emitOrAppendStrBuf(returnState); > state = returnState; > NS_HTML5_CONTINUE(eofloop); > } else { >- jArray<PRInt8,PRInt32> candidateArr = nsHtml5NamedCharacters::NAMES[candidate]; >- if (!candidateArr.length || candidateArr[candidateArr.length - 1] != ';') { >+ const nsHtml5CharacterName& candidateName = nsHtml5NamedCharacters::NAMES[candidate]; >+ if (!candidateName.length() || candidateName.charAt(candidateName.length() - 1) != ';') { > if ((returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) { > PRUnichar ch; > if (strBufMark == strBufLen) { > ch = '\0'; > } else { > ch = strBuf[strBufMark]; > } > if ((ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) {
Attachment #481184 -
Flags: review?(tglek) → review+
Assignee | ||
Comment 4•14 years ago
|
||
Comment on attachment 481184 [details] [diff] [review] Make named character names use a dedicated type Requesting approval, because this patch together with the patch for bug 502176 would advance the goal of bug 569629, i.e. getting rid of static initializers on the startup path.
Attachment #481184 -
Flags: approval2.0?
Updated•14 years ago
|
Attachment #481184 -
Flags: approval2.0? → approval2.0+
Assignee | ||
Comment 5•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/9623c2032948
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 6•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/9623c2032948
Status: ASSIGNED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•