Closed Bug 555940 Opened 14 years ago Closed 14 years ago

[HTML5] Named character names should use Java String and a dedicated type in C++

Categories

(Core :: DOM: HTML Parser, defect, P4)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Attachments

(1 file)

Currently, the named character name table in the HTML5 parser uses byte[] in Java and static byte array-backed heap-allocated jArray objects in C++.

It would be desirable to use java.lang.String objects in Java (to avoid the added footprint of the byte arrays in Java and GWT-compiled JS) and to use a dedicated struct type that could respond to charAt() and length() in C++ (to avoid the heap-allocated jArray objects).
In C++, the length should probably be baked into the character buffer so that the named character struct looks roughly like this:

struct nsHtml5CharacterName {
  short mOffset;
  PRInt32 length() {
    return static_cast<PRInt32> (nsHtml5NamedCharacters::ALL_NAMES[mOffset]);
  }
  PRUnichar charAt(PRInt32 index) {
    return static_cast<PRUnichar> (nsHtml5NamedCharacters::ALL_NAMES[mOffset + 1 + index]);
  }
}
Chances are the patch won't apply to the trunk. You can get my entire queue from 
http://hg.mozilla.org/users/hsivonen_iki.fi/patches/file/80f46a1077ac
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Attachment #481184 - Flags: review?(tglek)
Comment on attachment 481184 [details] [diff] [review]
Make named character names use a dedicated type

># HG changeset patch
># User Henri Sivonen <hsivonen@iki.fi>
># Date 1285659151 -10800
># Node ID f3ff660a74c470eb5271b495f84e695f7683893a
># Parent  c987796ba739eca3bb573d9621d451a49ed8cf50
>Bug 555940 - Make named character names use a dedicated data type. r=NOT_REVIEWED, a=NOT_APPROVED.
>
>diff --git a/parser/html/javasrc/Tokenizer.java b/parser/html/javasrc/Tokenizer.java
>--- a/parser/html/javasrc/Tokenizer.java
>+++ b/parser/html/javasrc/Tokenizer.java
>@@ -31,16 +31,17 @@
>  * "© Copyright 2004-2010 Apple Computer, Inc., Mozilla Foundation, and 
>  * Opera Software ASA. You are granted a license to use, reproduce and 
>  * create derivative works of this document."
>  */
> 
> package nu.validator.htmlparser.impl;
> 
> import nu.validator.htmlparser.annotation.Auto;
>+import nu.validator.htmlparser.annotation.CharacterName;
> import nu.validator.htmlparser.annotation.Const;
> import nu.validator.htmlparser.annotation.Inline;
> import nu.validator.htmlparser.annotation.Local;
> import nu.validator.htmlparser.annotation.NoLength;
> import nu.validator.htmlparser.common.EncodingDeclarationHandler;
> import nu.validator.htmlparser.common.Interner;
> import nu.validator.htmlparser.common.TokenHandler;
> import nu.validator.htmlparser.common.XmlViolationPolicy;
>@@ -3105,39 +3106,39 @@ public class Tokenizer implements Locato
>                          * identifiers in the first column of the named
>                          * character references table (in a case-sensitive
>                          * manner).
>                          */
>                         loloop: for (;;) {
>                             if (hi < lo) {
>                                 break outer;
>                             }
>-                            if (entCol == NamedCharacters.NAMES[lo].length) {
>+                            if (entCol == NamedCharacters.NAMES[lo].length()) {
>                                 candidate = lo;
>                                 strBufMark = strBufLen;
>                                 lo++;
>-                            } else if (entCol > NamedCharacters.NAMES[lo].length) {
>+                            } else if (entCol > NamedCharacters.NAMES[lo].length()) {
>                                 break outer;
>-                            } else if (c > NamedCharacters.NAMES[lo][entCol]) {
>+                            } else if (c > NamedCharacters.NAMES[lo].charAt(entCol)) {
>                                 lo++;
>                             } else {
>                                 break loloop;
>                             }
>                         }
> 
>                         hiloop: for (;;) {
>                             if (hi < lo) {
>                                 break outer;
>                             }
>-                            if (entCol == NamedCharacters.NAMES[hi].length) {
>+                            if (entCol == NamedCharacters.NAMES[hi].length()) {
>                                 break hiloop;
>                             }
>-                            if (entCol > NamedCharacters.NAMES[hi].length) {
>+                            if (entCol > NamedCharacters.NAMES[hi].length()) {
>                                 break outer;
>-                            } else if (c < NamedCharacters.NAMES[hi][entCol]) {
>+                            } else if (c < NamedCharacters.NAMES[hi].charAt(entCol)) {
>                                 hi--;
>                             } else {
>                                 break hiloop;
>                             }
>                         }
> 
>                         if (hi < lo) {
>                             break outer;
>@@ -3157,19 +3158,19 @@ public class Tokenizer implements Locato
>                         if ((returnState & DATA_AND_RCDATA_MASK) == 0) {
>                             cstart = pos;
>                         }
>                         state = transition(state, returnState, reconsume, pos);
>                         reconsume = true;
>                         continue stateloop;
>                     } else {
>                         // c can't be CR, LF or nul if we got here
>-                        byte[] candidateArr = NamedCharacters.NAMES[candidate];
>-                        if (candidateArr.length == 0
>-                                || candidateArr[candidateArr.length - 1] != ';') {
>+                        @Const @CharacterName String candidateName = NamedCharacters.NAMES[candidate];
>+                        if (candidateName.length() == 0
>+                                || candidateName.charAt(candidateName.length() - 1) != ';') {
>                             /*
>                              * If the last character matched is not a U+003B
>                              * SEMICOLON (;), there is a parse error.
>                              */
>                             if ((returnState & DATA_AND_RCDATA_MASK) != 0) {
>                                 /*
>                                  * If the entity is being consumed as part of an
>                                  * attribute, and the last character matched is
>@@ -6320,63 +6321,62 @@ public class Tokenizer implements Locato
>                          * identifiers in the first column of the named
>                          * character references table (in a case-sensitive
>                          * manner).
>                          */
>                         hiloop: for (;;) {
>                             if (hi == -1) {
>                                 break hiloop;
>                             }
>-                            if (entCol == NamedCharacters.NAMES[hi].length) {
>+                            if (entCol == NamedCharacters.NAMES[hi].length()) {
>                                 break hiloop;
>                             }
>-                            if (entCol > NamedCharacters.NAMES[hi].length) {
>+                            if (entCol > NamedCharacters.NAMES[hi].length()) {
>                                 break outer;
>-                            } else if (c < NamedCharacters.NAMES[hi][entCol]) {
>+                            } else if (c < NamedCharacters.NAMES[hi].charAt(entCol)) {
>                                 hi--;
>                             } else {
>                                 break hiloop;
>                             }
>                         }
> 
>                         loloop: for (;;) {
>                             if (hi < lo) {
>                                 break outer;
>                             }
>-                            if (entCol == NamedCharacters.NAMES[lo].length) {
>+                            if (entCol == NamedCharacters.NAMES[lo].length()) {
>                                 candidate = lo;
>                                 strBufMark = strBufLen;
>                                 lo++;
>-                            } else if (entCol > NamedCharacters.NAMES[lo].length) {
>+                            } else if (entCol > NamedCharacters.NAMES[lo].length()) {
>                                 break outer;
>-                            } else if (c > NamedCharacters.NAMES[lo][entCol]) {
>+                            } else if (c > NamedCharacters.NAMES[lo].charAt(entCol)) {
>                                 lo++;
>                             } else {
>                                 break loloop;
>                             }
>                         }
>                         if (hi < lo) {
>                             break outer;
>                         }
>                         continue;
>                     }
> 
>-                    // TODO warn about apos (IE) and TRADE (Opera)
>                     if (candidate == -1) {
>                         /*
>                          * If no match can be made, then this is a parse error.
>                          */
>                         errNoNamedCharacterMatch();
>                         emitOrAppendStrBuf(returnState);
>                         state = returnState;
>                         continue eofloop;
>                     } else {
>-                        byte[] candidateArr = NamedCharacters.NAMES[candidate];
>-                        if (candidateArr.length == 0
>-                                || candidateArr[candidateArr.length - 1] != ';') {
>+                        @Const @CharacterName String candidateName = NamedCharacters.NAMES[candidate];
>+                        if (candidateName.length() == 0
>+                                || candidateName.charAt(candidateName.length() - 1) != ';') {
>                             /*
>                              * If the last character matched is not a U+003B
>                              * SEMICOLON (;), there is a parse error.
>                              */
>                             if ((returnState & DATA_AND_RCDATA_MASK) != 0) {
>                                 /*
>                                  * If the entity is being consumed as part of an
>                                  * attribute, and the last character matched is
>diff --git a/parser/html/nsHtml5NamedCharacters.cpp b/parser/html/nsHtml5NamedCharacters.cpp
>--- a/parser/html/nsHtml5NamedCharacters.cpp
>+++ b/parser/html/nsHtml5NamedCharacters.cpp
>@@ -25,18 +25,16 @@
> #include "jArray.h"
> #include "nscore.h"
> #include "nsDebug.h"
> #include "prlog.h"
> #include "nsMemory.h"
> 
> #include "nsHtml5NamedCharacters.h"
> 
>-jArray<jArray<PRInt8,PRInt32>,PRInt32> nsHtml5NamedCharacters::NAMES;
>-
> const PRUnichar nsHtml5NamedCharacters::VALUES[][2] = {
> #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \
> { VALUE },
> #include "nsHtml5NamedCharactersInclude.h"
> #undef NAMED_CHARACTER_REFERENCE
> {0, 0} };
> 
> PRUnichar** nsHtml5NamedCharacters::WINDOWS_1252;
>@@ -97,57 +95,49 @@ enum NamePositions {
> NAME_##N##_DUMMY, /* automatically one higher than previous */ \
> NAME_##N##_START = NAME_##N##_DUMMY - 1, \
> NAME_##N##_END = NAME_##N##_START + LEN + FLAG,
> #include "nsHtml5NamedCharactersInclude.h"
> #undef NAMED_CHARACTER_REFERENCE
>   DUMMY_FINAL_NAME_VALUE
> };
> 
>-#define NAMED_CHARACTERS_COUNT 2138
>-
> /* check that the start positions will fit in 16 bits */
> PR_STATIC_ASSERT(NS_ARRAY_LENGTH(ALL_NAMES) < 0x10000);
> 
>-struct NamedCharacterData {
>-  PRUint16 nameStart;
>-  PRUint16 nameLen;
>-#ifdef DEBUG
>-  PRInt32 n;
>-#endif
>-};
>-
>-static const NamedCharacterData charData[NAMED_CHARACTERS_COUNT] = {
>+const nsHtml5CharacterName nsHtml5NamedCharacters::NAMES[] = {
> #ifdef DEBUG
>   #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \
> { NAME_##N##_START, LEN, N },
> #else
>   #define NAMED_CHARACTER_REFERENCE(N, CHARS, LEN, FLAG, VALUE) \
> { NAME_##N##_START, LEN, },
> #endif
> #include "nsHtml5NamedCharactersInclude.h"
> #undef NAMED_CHARACTER_REFERENCE
> };
> 
>+PRInt32
>+nsHtml5CharacterName::length() const
>+{
>+  return nameLen;
>+}
>+
>+PRUnichar
>+nsHtml5CharacterName::charAt(PRInt32 index) const
>+{
>+  return static_cast<PRUnichar> (ALL_NAMES[nameStart + index]);
>+}
>+
> void
> nsHtml5NamedCharacters::initializeStatics()
> {
>-  NAMES = jArray<jArray<PRInt8,PRInt32>,PRInt32>(NAMED_CHARACTERS_COUNT);
>-  PRInt8* allNames = const_cast<PRInt8*>(ALL_NAMES);
>-  for (PRInt32 i = 0; i < NAMED_CHARACTERS_COUNT; ++i) {
>-    const NamedCharacterData &data = charData[i];
>-    NS_ABORT_IF_FALSE(data.n == i,
>-                      "index error in nsHtml5NamedCharactersInclude.h");
>-    NAMES[i] = jArray<PRInt8,PRInt32>(allNames + data.nameStart, data.nameLen);
>-  }
>-
>   WINDOWS_1252 = new PRUnichar*[32];
>   for (PRInt32 i = 0; i < 32; ++i) {
>     WINDOWS_1252[i] = (PRUnichar*)&(WINDOWS_1252_DATA[i]);
>   }
> }
> 
> void
> nsHtml5NamedCharacters::releaseStatics()
> {
>-  NAMES.release();
>   delete[] WINDOWS_1252;
> }
>diff --git a/parser/html/nsHtml5NamedCharacters.h b/parser/html/nsHtml5NamedCharacters.h
>--- a/parser/html/nsHtml5NamedCharacters.h
>+++ b/parser/html/nsHtml5NamedCharacters.h
>@@ -25,19 +25,29 @@
> 
> #include "prtypes.h"
> #include "jArray.h"
> #include "nscore.h"
> #include "nsDebug.h"
> #include "prlog.h"
> #include "nsMemory.h"
> 
>+struct nsHtml5CharacterName {
>+  PRUint16 nameStart;
>+  PRUint16 nameLen;
>+  #ifdef DEBUG
>+  PRInt32 n;
>+  #endif
>+  PRInt32 length() const;
>+  PRUnichar charAt(PRInt32 index) const;
>+};
>+
> class nsHtml5NamedCharacters
> {
>   public:
>-    static jArray<jArray<PRInt8,PRInt32>,PRInt32> NAMES;
>+    static const nsHtml5CharacterName NAMES[];
>     static const PRUnichar VALUES[][2];
>     static PRUnichar** WINDOWS_1252;
>     static void initializeStatics();
>     static void releaseStatics();
> };
> 
> #endif // nsHtml5NamedCharacters_h_
>diff --git a/parser/html/nsHtml5Tokenizer.cpp b/parser/html/nsHtml5Tokenizer.cpp
>--- a/parser/html/nsHtml5Tokenizer.cpp
>+++ b/parser/html/nsHtml5Tokenizer.cpp
>@@ -1505,39 +1505,39 @@ nsHtml5Tokenizer::stateLoop(PRInt32 stat
>           if (c == '\0') {
>             NS_HTML5_BREAK(stateloop);
>           }
>           entCol++;
>           for (; ; ) {
>             if (hi < lo) {
>               NS_HTML5_BREAK(outer);
>             }
>-            if (entCol == nsHtml5NamedCharacters::NAMES[lo].length) {
>+            if (entCol == nsHtml5NamedCharacters::NAMES[lo].length()) {
>               candidate = lo;
>               strBufMark = strBufLen;
>               lo++;
>-            } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length) {
>+            } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length()) {
>               NS_HTML5_BREAK(outer);
>-            } else if (c > nsHtml5NamedCharacters::NAMES[lo][entCol]) {
>+            } else if (c > nsHtml5NamedCharacters::NAMES[lo].charAt(entCol)) {
>               lo++;
>             } else {
>               NS_HTML5_BREAK(loloop);
>             }
>           }
>           loloop_end: ;
>           for (; ; ) {
>             if (hi < lo) {
>               NS_HTML5_BREAK(outer);
>             }
>-            if (entCol == nsHtml5NamedCharacters::NAMES[hi].length) {
>+            if (entCol == nsHtml5NamedCharacters::NAMES[hi].length()) {
>               NS_HTML5_BREAK(hiloop);
>             }
>-            if (entCol > nsHtml5NamedCharacters::NAMES[hi].length) {
>+            if (entCol > nsHtml5NamedCharacters::NAMES[hi].length()) {
>               NS_HTML5_BREAK(outer);
>-            } else if (c < nsHtml5NamedCharacters::NAMES[hi][entCol]) {
>+            } else if (c < nsHtml5NamedCharacters::NAMES[hi].charAt(entCol)) {
>               hi--;
>             } else {
>               NS_HTML5_BREAK(hiloop);
>             }
>           }
>           hiloop_end: ;
>           if (hi < lo) {
>             NS_HTML5_BREAK(outer);
>@@ -1551,18 +1551,18 @@ nsHtml5Tokenizer::stateLoop(PRInt32 stat
>           emitOrAppendStrBuf(returnState);
>           if (!(returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) {
>             cstart = pos;
>           }
>           state = returnState;
>           reconsume = PR_TRUE;
>           NS_HTML5_CONTINUE(stateloop);
>         } else {
>-          jArray<PRInt8,PRInt32> candidateArr = nsHtml5NamedCharacters::NAMES[candidate];
>-          if (!candidateArr.length || candidateArr[candidateArr.length - 1] != ';') {
>+          const nsHtml5CharacterName& candidateName = nsHtml5NamedCharacters::NAMES[candidate];
>+          if (!candidateName.length() || candidateName.charAt(candidateName.length() - 1) != ';') {
>             if ((returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) {
>               PRUnichar ch;
>               if (strBufMark == strBufLen) {
>                 ch = c;
>               } else {
>                 ch = strBuf[strBufMark];
>               }
>               if (ch == '=' || (ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) {
>@@ -3595,39 +3595,39 @@ nsHtml5Tokenizer::eof()
>       case NS_HTML5TOKENIZER_CHARACTER_REFERENCE_TAIL: {
>         for (; ; ) {
>           PRUnichar c = '\0';
>           entCol++;
>           for (; ; ) {
>             if (hi == -1) {
>               NS_HTML5_BREAK(hiloop);
>             }
>-            if (entCol == nsHtml5NamedCharacters::NAMES[hi].length) {
>+            if (entCol == nsHtml5NamedCharacters::NAMES[hi].length()) {
>               NS_HTML5_BREAK(hiloop);
>             }
>-            if (entCol > nsHtml5NamedCharacters::NAMES[hi].length) {
>+            if (entCol > nsHtml5NamedCharacters::NAMES[hi].length()) {
>               NS_HTML5_BREAK(outer);
>-            } else if (c < nsHtml5NamedCharacters::NAMES[hi][entCol]) {
>+            } else if (c < nsHtml5NamedCharacters::NAMES[hi].charAt(entCol)) {
>               hi--;
>             } else {
>               NS_HTML5_BREAK(hiloop);
>             }
>           }
>           hiloop_end: ;
>           for (; ; ) {
>             if (hi < lo) {
>               NS_HTML5_BREAK(outer);
>             }
>-            if (entCol == nsHtml5NamedCharacters::NAMES[lo].length) {
>+            if (entCol == nsHtml5NamedCharacters::NAMES[lo].length()) {
>               candidate = lo;
>               strBufMark = strBufLen;
>               lo++;
>-            } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length) {
>+            } else if (entCol > nsHtml5NamedCharacters::NAMES[lo].length()) {
>               NS_HTML5_BREAK(outer);
>-            } else if (c > nsHtml5NamedCharacters::NAMES[lo][entCol]) {
>+            } else if (c > nsHtml5NamedCharacters::NAMES[lo].charAt(entCol)) {
>               lo++;
>             } else {
>               NS_HTML5_BREAK(loloop);
>             }
>           }
>           loloop_end: ;
>           if (hi < lo) {
>             NS_HTML5_BREAK(outer);
>@@ -3636,18 +3636,18 @@ nsHtml5Tokenizer::eof()
>         }
>         outer_end: ;
>         if (candidate == -1) {
> 
>           emitOrAppendStrBuf(returnState);
>           state = returnState;
>           NS_HTML5_CONTINUE(eofloop);
>         } else {
>-          jArray<PRInt8,PRInt32> candidateArr = nsHtml5NamedCharacters::NAMES[candidate];
>-          if (!candidateArr.length || candidateArr[candidateArr.length - 1] != ';') {
>+          const nsHtml5CharacterName& candidateName = nsHtml5NamedCharacters::NAMES[candidate];
>+          if (!candidateName.length() || candidateName.charAt(candidateName.length() - 1) != ';') {
>             if ((returnState & NS_HTML5TOKENIZER_DATA_AND_RCDATA_MASK)) {
>               PRUnichar ch;
>               if (strBufMark == strBufLen) {
>                 ch = '\0';
>               } else {
>                 ch = strBuf[strBufMark];
>               }
>               if ((ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) {
Attachment #481184 - Flags: review?(tglek) → review+
Blocks: 502176
Comment on attachment 481184 [details] [diff] [review]
Make named character names use a dedicated type

Requesting approval, because this patch together with the patch for bug 502176 would advance the goal of bug 569629, i.e. getting rid of static initializers on the startup path.
Attachment #481184 - Flags: approval2.0?
Attachment #481184 - Flags: approval2.0? → approval2.0+
http://hg.mozilla.org/mozilla-central/rev/9623c2032948
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
http://hg.mozilla.org/mozilla-central/rev/9623c2032948
Status: ASSIGNED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: