Bug 487949 (html5-parsing-land)

[HTML5] Land HTML5 parser on m-c preffed off

RESOLVED FIXED

Status

()

Core
HTML: Parser
P1
normal
RESOLVED FIXED
9 years ago
6 years ago

People

(Reporter: hsivonen, Assigned: hsivonen)

Tracking

(Blocks: 2 bugs)

Trunk
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9.2 -
wanted1.9.2 -

Firefox Tracking Flags

(status1.9.2 beta1-fixed)

Details

Attachments

(6 attachments, 11 obsolete attachments)

18.66 KB, patch
Details | Diff | Splinter Review
20.21 KB, patch
Details | Diff | Splinter Review
8.42 KB, patch
Details | Diff | Splinter Review
1.43 MB, patch
Details | Diff | Splinter Review
1.30 MB, patch
Details | Diff | Splinter Review
1.56 KB, patch
Details | Diff | Splinter Review
(Assignee)

Description

9 years ago
I think the HTML5 parser now fulfills the criteria that was discussed on Monday for landing it on mozilla-central preffed off. (That is, the dogfood blocker crashes are gone.)

Bug 468708 isn't preffable and needs to land first as discussed previously.

The plan is still to merge the history from the HTML5 parsing repo as opposed to just landing one big patch, right?

Who should I ask to r and sr?

Should the snapshot of the Java source corresponding to the C++ version of the parser be checked into some corner of mozilla-central to make sure that the 'preferred form for making modifications' travels with the C++ code?
We definitely want to check in the java source that corresponds to the C++ code that is checked in. The Java source is what according to GPL is considered 'source code'.

Does the C++ code contain the comments that is in the java source? We still haven't really resolved the whole issue of if we want to eventually be modifying the C++ code, or if we're planning on keeping the java source around forever.

As for reviews. I'd say mrbkap and me at least. Possibly more people depending on the actual diff.

I'll try to find you on irc so we can figure out the issue of merging or having a single patch committed.
(Assignee)

Comment 2

9 years ago
(In reply to comment #1)
> We definitely want to check in the java source that corresponds to the C++ code
> that is checked in. The Java source is what according to GPL is considered
> 'source code'.

OK. Do you have a suggestion for the directory location in the tree? Would it be ok to put it as a sibling to the parser src/ as java/ or somesuch?

> Does the C++ code contain the comments that is in the java source?

It doesn't. Copying over javadocs would be easy. Copying over inline comments would be painful, since the current AST for the Java files doesn't retain them.

> We still haven't really resolved the whole issue of if we want to eventually be
> modifying the C++ code, or if we're planning on keeping the java source around
> forever.

I suggest keeping the Java source around.

> As for reviews. I'd say mrbkap and me at least. Possibly more people depending
> on the actual diff.

Should there be a bugzilla-attached diff for a landing of this size or would reviewers read the source from hg?
(Assignee)

Comment 3

9 years ago
Created attachment 373077 [details] [diff] [review]
Patch as of rev 2770407be167 as if bug 468708 had already landed part 1

Attaching part one of a huge patch to have something to set review flags on. This patch doesn't contain the Java code, since its location in the tree is undecided.
Attachment #373077 - Flags: superreview?(jonas)
Attachment #373077 - Flags: review?(mrbkap)
(Assignee)

Comment 4

9 years ago
Created attachment 373078 [details] [diff] [review]
 Patch as of rev 2770407be167 as if bug 468708 had already landed part 2

Bugzilla didn't allow all this in one attachment...
Attachment #373078 - Flags: superreview?(jonas)
Attachment #373078 - Flags: review?(mrbkap)
(Assignee)

Comment 5

9 years ago
Note to self:
Patch generated with hg diff -r 2199410bc4bb:tip
Priority: -- → P1
Applied all three patches (including the one for bug 468708), and, after a little tweaking, it works!  At first glance, the 468708 patch seems to have removed a couple of method prototypes that the big patch assumes:

diff --git a/content/html/content/src/nsGenericHTMLElement.h b/content/html/content/src/nsGenericHTMLElement.h
--- a/content/html/content/src/nsGenericHTMLElement.h
+++ b/content/html/content/src/nsGenericHTMLElement.h
@@ -98,23 +98,27 @@ public:
   nsresult DOMQueryInterface(nsIDOMHTMLElement *aElement, REFNSIID aIID,
                              void **aInstancePtr);
 
   // From nsGenericElement
   nsresult CopyInnerTo(nsGenericElement* aDest) const;
 
   // Implementation for nsIDOMNode
   NS_METHOD GetNodeName(nsAString& aNodeName);
+  NS_METHOD GetLocalName(nsAString& aLocalName);
 
   // Implementation for nsIDOMElement
   NS_METHOD SetAttribute(const nsAString& aName,
                          const nsAString& aValue);
   NS_METHOD GetTagName(nsAString& aTagName);
   NS_METHOD GetElementsByTagName(const nsAString& aTagname,
                                  nsIDOMNodeList** aReturn);
+  NS_METHOD GetElementsByTagNameNS(const nsAString& aNamespaceURI,
+                                   const nsAString& aLocalName,
+                                   nsIDOMNodeList** aReturn);
 
   // nsIDOMHTMLElement methods. Note that these are non-virtual
   // methods, implementations are expected to forward calls to these
   // methods.
   nsresult GetId(nsAString& aId);
   nsresult SetId(const nsAString& aId);
   nsresult GetTitle(nsAString& aTitle);
   nsresult SetTitle(const nsAString& aTitle);
diff --git a/content/html/document/src/nsHTMLDocument.h b/content/html/document/src/nsHTMLDocument.h
--- a/content/html/document/src/nsHTMLDocument.h
+++ b/content/html/document/src/nsHTMLDocument.h
@@ -179,16 +179,23 @@ public:
   virtual PRInt32 GetNumFormsSynchronous();
   virtual void TearingDownEditor(nsIEditor *aEditor);
   virtual void SetIsXHTML(PRBool aXHTML) { mIsRegularHTML = !aXHTML; }
   PRBool IsXHTML()
   {
     return !mIsRegularHTML;
   }
 
+#ifdef DEBUG
+  virtual nsresult CreateElem(nsIAtom *aName, nsIAtom *aPrefix,
+                              PRInt32 aNamespaceID,
+                              PRBool aDocumentDefaultType,
+                              nsIContent** aResult);
+#endif
+
   nsresult ChangeContentEditableCount(nsIContent *aElement, PRInt32 aChange);
 
   virtual EditingState GetEditingState()
   {
     return mEditingState;
   }
 
   virtual void DisableCookieAccess()
(In reply to comment #3)
> Created an attachment (id=373077) [details]
> Patch as of rev 2770407be167 as if bug 468708 had already landed part 1
> 
> Attaching part one of a huge patch to have something to set review flags on.
> This patch doesn't contain the Java code, since its location in the tree is
> undecided.

Any arguments against parser/html5/?  Seems roughly analogous to parser/expat/.

Also, I didn't notice the java sources in http://hg.mozilla.org/users/mrbkap_mozilla.com/html5parsing, though I might just be overlooking them.  Is there a current public version available?
(In reply to comment #6)
> At first glance, the 468708 patch seems to have
> removed a couple of method prototypes that the big patch assumes:

Disregard that.  I see it's fixed with http://hg.mozilla.org/users/mrbkap_mozilla.com/html5parsing/rev/70bf90b92bf1
(Assignee)

Comment 9

9 years ago
(In reply to comment #7)
> Any arguments against parser/html5/?  Seems roughly analogous to parser/expat/.

The reason why the parser lives under content/ is that it needed to link against stuff that is readily visible under content/ but a constant pain under parser/. Furthermore, I lacked the right build system clue to make parser/ see everything content/ sees and link into the same .so as content/.

Initially, I tried to set things up under parser/.

> Also, I didn't notice the java sources in
> http://hg.mozilla.org/users/mrbkap_mozilla.com/html5parsing, though I might
> just be overlooking them.  Is there a current public version available?

The Java code is available for svn checkout from http://svn.versiondude.net/whattf/htmlparser/trunk/

The patches on this bug are already stale. It seems to me that maintaining bugzilla patches that are larger than bugzilla's limit is inconvenient. Should I just attach a small placeholder file to set review flag on and let reviewers signal at which hg rev they started reviewing?
(Assignee)

Updated

9 years ago
Attachment #373077 - Attachment is obsolete: true
Attachment #373077 - Flags: superreview?(jonas)
Attachment #373077 - Flags: review?(mrbkap)
(Assignee)

Updated

9 years ago
Attachment #373078 - Attachment is obsolete: true
Attachment #373078 - Flags: superreview?(jonas)
Attachment #373078 - Flags: review?(mrbkap)
(In reply to comment #9)
> The patches on this bug are already stale. It seems to me that maintaining
> bugzilla patches that are larger than bugzilla's limit is inconvenient. Should
> I just attach a small placeholder file to set review flag on and let reviewers
> signal at which hg rev they started reviewing?

How big do they need to be?
(In reply to comment #10)
> How big do they need to be?

I mean, how big are the attachments you're trying to attach?
(Assignee)

Comment 12

9 years ago
(In reply to comment #11)
> (In reply to comment #10)
> > How big do they need to be?
> 
> I mean, how big are the attachments you're trying to attach?

Currently over 1.5 MB but under 2.0 MB. If a future patch contains all the Java code, too, the total size probably goes to somewhere around 4 MB to 5 MB.

Do reviewers prefer huge patch attachments or version control pointers?
(Assignee)

Comment 13

9 years ago
Created attachment 374272 [details] [diff] [review]
The HTML5 parsing diff as if the DOM namespace patch had already landed
Attachment #374272 - Flags: superreview?(jonas)
Attachment #374272 - Flags: review?(mrbkap)
(Assignee)

Comment 14

9 years ago
Created attachment 375271 [details] [diff] [review]
The HTML5 parsing diff as if the DOM namespace patch had already landed

Now with more care for elements that need special treatment when inserted to content tree by the parser.
Attachment #374272 - Attachment is obsolete: true
Attachment #375271 - Flags: superreview?(jonas)
Attachment #375271 - Flags: review?(mrbkap)
Attachment #374272 - Flags: superreview?(jonas)
Attachment #374272 - Flags: review?(mrbkap)
(Assignee)

Comment 15

9 years ago
Created attachment 381762 [details] [diff] [review]
HTML5 parsing patch, now with fix for bug 483158

New patch with major tokenizer loop refactoring for bug 483158.

This patch is against trunk. I deliberately haven't bothered syncing this with the latest iteration of bug 468708, since it seems more useful to land it first and then just sync the HTML5 repo with trunk.
Attachment #375271 - Attachment is obsolete: true
Attachment #381762 - Flags: superreview?(jonas)
Attachment #381762 - Flags: review?(mrbkap)
Attachment #375271 - Flags: superreview?(jonas)
Attachment #375271 - Flags: review?(mrbkap)
(In reply to comment #15)
> This patch is against trunk. I deliberately haven't bothered syncing this with
> the latest iteration of bug 468708, since it seems more useful to land it first
> and then just sync the HTML5 repo with trunk.

Wholly agreed.
Seems like too many new and delete calls, also PRUnichar read/write arrays instead of const arrays. Fixable in the translator? We really need to avoid going to the heap for what should be LIFO and static storage.

Anyone have a clue about code footprint and pageload performance?

/be
roc argues we may not see enough new/delete traffic to matter; he could be right. My comment 17 was based on direct experience with similar code in Gecko and JS. But it may be different this time -- so measurement of static and dynamic footprint, and Tp and other metrics, is the important thing.

Looks like good code, attractive to have in a canonical Java form in the long run: perhaps some day we could translate to managed "JS++" or "C--" to get real memory safety and other properties.

Do we really need the ns prefix nowadays?

/be
(In reply to comment #18)
> Do we really need the ns prefix nowadays?

Specifically, cjones has been using the mozilla C++ namespace to good effect. Why not do the same here?

/be
(Assignee)

Comment 20

9 years ago
(In reply to comment #17)
> Seems like too many new and delete calls,

New/delete is hit mostly for the nsStrings for each attribute value and for each stack node in the tree builder.

The stack nodes can't be heap-allocated, because the call stack goes away and the tree builder stack stays.

Attribute values once they get into the attribute holder can't be heap-allocated for the same reason. However, in principle, the nsStrings that point to the backing buffers should be stack-allocated before the string ends up in the attribute holder and the attribute value array in the attribute holder could be a Gecko-specific string array.

That they aren't has been recognized as an issue even prior to any implementation work. However, attempts to avoid hitting new/delete for nsString have been deferred in order to get the whole thing otherwise working.

The main issue is that the Java code assumes it can return a string as the real return value of a method. cjones and I speculated on what would happen if one tried to return an nsString by value when you know for sure the backing buffer has no other references. Maybe the simplest solution is there. I have tried to avoid getting carried away with implementing more complex transformations in the Java-to-C++ translator than the transformations that are required to get the whole thing running and passing tests first.

As for the appearance of hitting new/delete per tag token or attribute name, that code path is taken only for element and attribute names that aren't pre-interned. All well-known names are pre-interned. Thus, optimizing away new/delete in the not-well-know name case didn't seem worthwhile at this point.

The UTF-16 buffer wrappers need to stay around when the call stack goes away, so they are heap-allocated. However, the IO driver doesn't delete the last buffer but recycles it.

> also PRUnichar read/write arrays instead of const arrays. Fixable in the translator?

I'm not using const in the generated code at all, because I couldn't get a confirmation that either GCC or MS Visual C++ puts const stuff on an actual locked memory page. It didn't seem useful to have const correctness only for static checking of the generated code without confirmed effects on memory segmentation.

It seems to me that making the translator infer constness wouldn't be feasible, but it would be feasible to annotate stuff manually with @Const and have it translate to const if constness is desired.

> We really need to avoid going to the heap for what should be LIFO and static storage.

A lot of the stuff that is conceptually static is heap-allocated at app startup, because the advice I got was that it wouldn't have been practical to try to get the compiler arrange the right structures at compile time.

> Anyone have a clue about code footprint and pageload performance?

No clue about code footprint. My local benchmarking says pageload should average to being a bit better than trunk, but individual pages can be a bit worse. Actual Talos-run Tp is broken, which it my top concern now aside from reviews.

(In reply to comment #18)
> roc argues we may not see enough new/delete traffic to matter; he could be
> right. My comment 17 was based on direct experience with similar code in Gecko
> and JS.

Making a guess based on Shark profiling, I guess that there's more to win at first by tweaking how exactly notifications get flushed. The time spent in frame construction in response to notifications is very significant compared to actual parsing. In fact, even the act of appending to the DOM before notifying shows up on profiles.

Also, another potential easy perf win would be bug 482920.

> But it may be different this time -- so measurement of static and
> dynamic footprint, and Tp and other metrics, is the important thing.

Yeah, I'm trying to work out the Tp problem, but it's hard to guess what goes wrong in the black box when locally things are fine.

> Looks like good code, attractive to have in a canonical Java form in the long
> run: perhaps some day we could translate to managed "JS++" or "C--" to get real
> memory safety and other properties.

Translating to "JS++" in the future seems feasible. (The parser core already translates to unreadable JS using GWT.)

(In reply to comment #19)
> (In reply to comment #18)
> > Do we really need the ns prefix nowadays?
> 
> Specifically, cjones has been using the mozilla C++ namespace to good effect.
> Why not do the same here?

I wrote the part that generated the nsHtml5 prefixes before the mozilla C++ namespace discussion took place. It's possible to change this. Should I pursue changing this as a gating factor to landing?

- -

Further, I noticed that this was discussed on IRC while it was night time in Finland, so I'll address the points from IRC, too:

> < roc> a reasonable question is how debuggable the generated code setup is

The generated code works fine in gdb. The structure is the same as the structure of the Java code, so it's super-easy to locate the original source line when you are ready to fix.

I want to do some AST rewriting to make the tokenizer use more goto and less switch, but the result should still be obviously recognizable comparing it to the Java code.

> < mrbkap> fwiw, hsivonen did a bunch of work to make the output of the 
>               generator readable and otherwise consumable by normal human 
>               beings.

Not really. The human-readable structure of the C++ code came out of the translation method "for free".

What could be improved would be carrying over Javadocs (easy) and carrying over /* */ comments that are between statements (requires hacking the Java-to-AST parser). The #1 reason I haven't done this is that I've tried hard to focus on the post-compilation functional qualities of the C++ version instead of getting carried away with polishing the translator.

Carrying over // comments would be a pain, so it would make more sense to convert those to /* */ between statements if there are any we really want to carry over.

> < mrbkap> And he's been dealing with and fixing bugs in the generated 
                code for at least 6 months now.

I've been dealing with bugs in the hand-written Gecko-specific code and somewhat in hand-written parser core code that's only relevant to memory management in C++. The translation step hasn't introduced any bugs to speak of.

> < stuart> mrbkap: wouldn't that time be better spent fixing the parser? :/

The translator isn't a time sink. The translated code isn't a time sink.

The top two time consumers have been:
 1) Tryserver Mac builds having utterly bogus symbols that lead me astray when there were crashes.
 2) Crashes that I could have avoided if I had opted to generate automatic refcounting for tree builder stack nodes instead of doing it manually. On surface it seemed so simple that I naively decided on doing it manually. (The issue of managing this memory would have come up even in a from-scratch C++ implementation, so I think this can't be blamed on the translation.)
 3) Various Gecko integration issues that have nothing to do with where the tokenizer and abstract tree builder came from.

The translator itself has been pretty easy and fast to write and it gives the tokenizer and tree builder in C++ "for free" since the Java effort was already sunk cost. I think time-wise the translator has been a win and the issues have been in code that I'd have had to write anyway had I written the whole thing from scratch in C++.

> 06:40 < brendan> hand-coding a C++ parser based on Hixie's spec, when henri is 
>                 maintaining the java impl and using it to validate, looks like 
>                 a high cost too
> 06:40 < brendan> plausibly higher than maintaining the translator, but who 
>                 knows?

I'd say maintaining two parser would be a higher cost than maintaining the translator. The translator was really much easier to write and maintain than it sounds, because the Java-code is so C-ish to begin with and the off-the-shelf Java AST package came with an AST-to-Java serializer that was easy to tweak to emit C++. It's nowhere near anything like writing a generic Java-to-C++ translator. It's totally specific to this project.

> < brendan> more emacs or vim and C++ sweat is not leverage

Having the logic of the parser core in an Eclipse-tractable form has been a major win.
(Assignee)

Comment 21

9 years ago
> The stack nodes can't be heap-allocated, because the call stack goes away and
> the tree builder stack stays.

Actually, the stack nodes could be passed by value and go into an array of stack nodes instead of going into an array of pointers to stack nodes if we can convince Hixie to eliminate table taint.

The current design assumes that stack nodes have one bit of writable data. (In the current impl., they don't actually have a writable bit. I took the writable bit out per Hixie's IRC remarks and my own wishes, but then the writable bit stayed in the spec and I've dragged my feet in the hope that I can get Hixie to take it out.)

But anyway, getting rid of table taint would give more degrees of code design freedom here and for speculative parsing, too. Gecko's old parser has table taint but WebKit has gotten away with not having it. I haven't yet seen a real-world page that broke in a user-noticeable way due to lack of table taint.
(Assignee)

Comment 22

9 years ago
Hmm. Or maybe one could live dangerously and have the taint bit out of sync in the copies whose taint bit is never read anyway and still pass the whole stack node by value.
We really should get rid of table taint. If we can show that it simplifies the parser design and that it doesn't appear to break any pages, I don't see why we couldn't get rid of it.

I'd actually suggest that we remove taint asap so that we can get testing on if that causes any problems.
(Assignee)

Comment 24

9 years ago
> The stack nodes can't be heap-allocated, because the call stack goes away and
> the tree builder stack stays.
>
> Attribute values once they get into the attribute holder can't be
> heap-allocated for the same reason.

Argh. s/heap-allocated/stack-allocated/ in both paragraphs.
(Assignee)

Comment 25

9 years ago
Oh, one more thing about stack nodes: Stack nodes need to have object identity so that you can check if a given stack node on the list of active formatting elements is also on the stack. If the stack weren't heap-allocated objects, they'd need to have some kind of serial number or similar ugliness to maintain identity across the two arrays.

(In reply to comment #23)
> I'd actually suggest that we remove taint asap so that we can get testing on if
> that causes any problems.

The HTML5 repo currently doesn't have taint. It matches what WebKit does for foster-parenting text nodes. 

I'll ping Hixie again about taint, although the reason I gave earlier today is a bit shaky and sharing stack nodes across speculations is a more relevant reason.
(Assignee)

Comment 26

9 years ago
Created attachment 382713 [details] [diff] [review]
HTML5 parsing patch, now diffed after the namespace landing
Attachment #381762 - Attachment is obsolete: true
Attachment #382713 - Flags: superreview?(jonas)
Attachment #382713 - Flags: review?(mrbkap)
Attachment #381762 - Flags: superreview?(jonas)
Attachment #381762 - Flags: review?(mrbkap)
(In reply to comment #20)
> (In reply to comment #17)
> The main issue is that the Java code assumes it can return a string as the real
> return value of a method. cjones and I speculated on what would happen if one
> tried to return an nsString by value when you know for sure the backing buffer
> has no other references. Maybe the simplest solution is there.

Sounds doable, we've had nsTAdoptingString or ancient XPIDLString stuff for this kind of buffer hand-off. Could be done in a separate bug for sure.

> As for the appearance of hitting new/delete per tag token or attribute name,
> that code path is taken only for element and attribute names that aren't
> pre-interned. All well-known names are pre-interned.

Right, I saw that pretty quickly and adverted to it over IRC>

I wondered about the use of binary search but didn't look further. Seems like you could perfect-hash your way to O(1) lookup time, but it probably doesn't matter at this point.

> I'm not using const in the generated code at all, because I couldn't get a
> confirmation that either GCC or MS Visual C++ puts const stuff on an actual
> locked memory page. It didn't seem useful to have const correctness only for
> static checking of the generated code without confirmed effects on memory
> segmentation.

Heap allocations are surely not write-protected, but const statics are in a readonly segment. This is where const strings and vtbls go; we confirmed this long ago on the then-current toolchains (2002?).

> Making a guess based on Shark profiling, I guess that there's more to win at
> first by tweaking how exactly notifications get flushed. The time spent in
> frame construction in response to notifications is very significant compared to
> actual parsing. In fact, even the act of appending to the DOM before notifying
> shows up on profiles.

This sounds like a separate, pre-existing bug. On file?

> > Looks like good code, attractive to have in a canonical Java form in the long
> > run: perhaps some day we could translate to managed "JS++" or "C--" to get real
> > memory safety and other properties.
> 
> Translating to "JS++" in the future seems feasible. (The parser core already
> translates to unreadable JS using GWT.)

How does the GWT-generated code perform on modern browsers, have you tested?

> > Specifically, cjones has been using the mozilla C++ namespace to good effect.
> > Why not do the same here?
> 
> I wrote the part that generated the nsHtml5 prefixes before the mozilla C++
> namespace discussion took place. It's possible to change this. Should I pursue
> changing this as a gating factor to landing?

Not if it's inconvenient, but it seems easy enough. Mainly want to build on the precedent, get away from perpetuating "ns". It's too easy to copy it and never get back to cleaning it up. OTOH we could have a flag day and change everything at once. Your call.

>  1) Tryserver Mac builds having utterly bogus symbols that lead me astray when
> there were crashes.

Bug on file?

>  2) Crashes that I could have avoided if I had opted to generate automatic
> refcounting for tree builder stack nodes instead of doing it manually. On
> surface it seemed so simple that I naively decided on doing it manually. (The
> issue of managing this memory would have come up even in a from-scratch C++
> implementation, so I think this can't be blamed on the translation.)

By hand ref-counting is a loser, yeah.

>  3) Various Gecko integration issues that have nothing to do with where the
> tokenizer and abstract tree builder came from.

Seems like grounds for bug filing -- don't mean to nag, do mean to encourage you to file, even if the bugs are somewhat design oriented and vague. Or start newsgroup threads.

As a forcing function for architecture and code cleanup, the new parser may also deliver considerable benefits.

> > < brendan> more emacs or vim and C++ sweat is not leverage
> 
> Having the logic of the parser core in an Eclipse-tractable form has been a
> major win.

I hear this! Outside of this bug, perhaps in mail or a newsgroup, I would be interested to learn more about how you are using Eclipse, what refactorings it helps automate, things like that. Thanks,

/be
@Henri Do you remember your rationale for removing the select/optgroup nodes from these refetests?

https://bugzilla.mozilla.org/attachment.cgi?id=382713&action=diff#a/content/html/document/reftests/bug448564-2_well-formed.html_sec1

Bigger question: what tests should the HTML5 parser pass? A strict superset of the tests the HTML4 parser is subject to? What is that set, exactly? It's larger than just tests/parser/htmlparser, since lots of other tests are affected by parser behavior, e.g. reftests. Is it okay to throw out or weaken existing tests in order to obtain a set of tests that both parsers pass?
(In reply to comment #28)
> Bigger question: what tests should the HTML5 parser pass? A strict superset of
> the tests the HTML4 parser is subject to?

s/strict superset/superset/
(Assignee)

Comment 30

9 years ago
(In reply to comment #27)
> Sounds doable, we've had nsTAdoptingString or ancient XPIDLString stuff for
> this kind of buffer hand-off. Could be done in a separate bug for sure.

Filed bug 497818.

> I wondered about the use of binary search but didn't look further. Seems like
> you could perfect-hash your way to O(1) lookup time, but it probably doesn't
> matter at this point.

Yeah, I wanted to invent a perfect hash function, but then I decided to avoid spending time on it while the exact list of values that it needs to hash isn't set in stone.

> > I'm not using const in the generated code at all, because I couldn't get a
> > confirmation that either GCC or MS Visual C++ puts const stuff on an actual
> > locked memory page. It didn't seem useful to have const correctness only for
> > static checking of the generated code without confirmed effects on memory
> > segmentation.
> 
> Heap allocations are surely not write-protected, but const statics are in a
> readonly segment. This is where const strings and vtbls go; we confirmed this
> long ago on the then-current toolchains (2002?).

What should become const statics? I wanted the named character table and the pre-interned token tables to go into the read-only segment, but when it turned out that making the well-known names truly static wasn't doable because static atoms aren't truly static and the name objects point to atoms, I gave up on both the making the name objects static and making the named character table static.

> > Making a guess based on Shark profiling, I guess that there's more to win at
> > first by tweaking how exactly notifications get flushed. The time spent in
> > frame construction in response to notifications is very significant compared to
> > actual parsing. In fact, even the act of appending to the DOM before notifying
> > shows up on profiles.
> 
> This sounds like a separate, pre-existing bug. On file?

There is bug 482925 on file about tweaking notifications. 

> > > Looks like good code, attractive to have in a canonical Java form in the long
> > > run: perhaps some day we could translate to managed "JS++" or "C--" to get real
> > > memory safety and other properties.
> > 
> > Translating to "JS++" in the future seems feasible. (The parser core already
> > translates to unreadable JS using GWT.)
> 
> How does the GWT-generated code perform on modern browsers, have you tested?

Having TraceMonkey on or off doesn't matter for parsing the SVG Tiger into a native rendered DOM, IIRC. So either the code doesn't JIT well or it's the stuff on the other side of the JS/native boundary that is more significant.

> Not if it's inconvenient, but it seems easy enough. Mainly want to build on the
> precedent, get away from perpetuating "ns". It's too easy to copy it and never
> get back to cleaning it up. OTOH we could have a flag day and change everything
> at once. Your call.

Filed by bug 497820 to remind me. However, I think I'll pursue addressing mochitest and Tp issues first.

> >  1) Tryserver Mac builds having utterly bogus symbols that lead me astray when
> > there were crashes.
> 
> Bug on file?

Yes, bug 495030.

> >  3) Various Gecko integration issues that have nothing to do with where the
> > tokenizer and abstract tree builder came from.
> 
> Seems like grounds for bug filing -- don't mean to nag, do mean to encourage
> you to file, even if the bugs are somewhat design oriented and vague. Or start
> newsgroup threads.
...
> I hear this! Outside of this bug, perhaps in mail or a newsgroup, I would be
> interested to learn more about how you are using Eclipse, what refactorings it
> helps automate, things like that. Thanks,

I'll try to find something useful to say on these points in the newsgroups/other bugs.

(In reply to comment #28)
> @Henri Do you remember your rationale for removing the select/optgroup nodes
> from these refetests?
> 
> https://bugzilla.mozilla.org/attachment.cgi?id=382713&action=diff#a/content/html/document/reftests/bug448564-2_well-formed.html_sec1

Per HTML5, a lone optgroup does not make the parser infer a select. The HTML5 behavior matches WebKit, Opera and IE. The old Gecko parser is alone with the select inferring behavior.

> Bigger question: what tests should the HTML5 parser pass? A strict superset of
> the tests the HTML4 parser is subject to? What is that set, exactly? It's
> larger than just tests/parser/htmlparser, since lots of other tests are
> affected by parser behavior, e.g. reftests. Is it okay to throw out or weaken
> existing tests in order to obtain a set of tests that both parsers pass?

My understanding of the criteria for landing it preffed off was that it didn't need to pass particular tests as long as it doesn't bother the trunk passing tests with the old parser active.

As for turning the pref on, I'd expect the parser to have to pass the tree builder tests of the html5lib suite, the Java version to pass the html5lib tokenizer suite (probably not worthwhile to build a C++ harness for that) and the browser as a whole to pass existing reftests and mochitests except ones that are identified as contradicting HTML5 and where the HTML5 behavior is preferred. (This is just my expectation.)
Flags: wanted1.9.2?
Flags: blocking1.9.2?
> static atoms aren't truly static and the name objects point to atoms

A trick we use in some places is to store an nsIAtom** in the constant data, see e.g. the callers to FindAttrValueIn.
> Having TraceMonkey on or off doesn't matter for parsing the SVG Tiger

If you can create a standalone testcase for this, file a bug, please?
(Assignee)

Comment 33

9 years ago
(In reply to comment #31)
> > static atoms aren't truly static and the name objects point to atoms
> 
> A trick we use in some places is to store an nsIAtom** in the constant data,
> see e.g. the callers to FindAttrValueIn.

I went with the current setup after this thread:
http://groups.google.fi/group/mozilla.dev.platform/browse_thread/thread/e323c6ea1755b7e5/c5a7099f3aaa6893

(In reply to comment #32)
> > Having TraceMonkey on or off doesn't matter for parsing the SVG Tiger
> 
> If you can create a standalone testcase for this, file a bug, please?

OK, I'll do that.

- -

About Tp numbers: If Tp didn't somehow time out, local testing suggests the HTML5 parser (currently without the contemplated goto optimizations) should give a 3% improvement to Tp (4% improvement to min times and 1% improvement to max times).
(Assignee)

Updated

9 years ago
Depends on: 497580
(Assignee)

Updated

9 years ago
No longer depends on: 497580
(Assignee)

Updated

9 years ago
Depends on: 496712
(Assignee)

Comment 34

9 years ago
I'm only seeing small differences between the perf of Firefox 3.5b99, HTML5 builds with HTML5 active and with HTML5 builds with HTML5 inactive.

However, on tryserver, builds with no HTML5 parser at all are green but even HTML5 builds that have the HTML5 parser inactive are insanely slow when running the old parser.

This seems like a blocking issue. (I feel terrible for not trying this configuration on the tryserver before. I had only tried HTML5 on and same point on trunk without any HTML5 code in the build at all.)

There's something about the latent presence of the HTML5 parser that slows things down on the tryserver even when the HTML5 parser isn't run (only its pseudo-static memory is initialized at app startup). The most reasonable explanation I can come up with is that the different set of static atoms makes the atom table slow somehow. But if that were the explanation, I should be seeing the same slowdown locally, which I don't.

(I'm in the process of verifying with printfs that the latent HTML5 parser really doesn't get run and doesn't get its statics initialized more than once.)
Have you tried running the tryserver-generated build locally?
(Assignee)

Comment 36

9 years ago
(In reply to comment #34)
> However, on tryserver, builds with no HTML5 parser at all are green but even
> HTML5 builds that have the HTML5 parser inactive are insanely slow when running
> the old parser.

Fortunately, this was a misdiagnosis my part. The last orange Tp run had HTML5 parsing enabled. Tp is green with latent HTML5 parsing. I'll now keep the HTML5 repo in the preffed-off state in order to simulate the situation of having the parser landed on trunk preffed off.

However, with latent HTML5 parser mochitest times out. Now investigating that instead.

I have verified that the HTML5 parser is indeed latent on the tryserver in those builds and only participates in pseudo-static initialization and shutdown and well as in the global atom table load factor.

(In reply to comment #35)
> Have you tried running the tryserver-generated build locally?

Yes, I have. (Both HTML5 enabled and latent.) The numbers are similar to locally-built builds.

Since my dev machine has more RAM than tryserver talos, I'll try running on a RAM-limited Linux VM image next.
(Assignee)

Comment 37

9 years ago
Created attachment 383652 [details] [diff] [review]
Hopefully a landable patch (except still doesn't have .java sources)
Attachment #382713 - Attachment is obsolete: true
Attachment #383652 - Flags: superreview?(jonas)
Attachment #383652 - Flags: review?(mrbkap)
Attachment #382713 - Flags: superreview?(jonas)
Attachment #382713 - Flags: review?(mrbkap)
(Assignee)

Comment 38

9 years ago
Created attachment 383653 [details] [diff] [review]
Interdiff of the previous and current patch
(Assignee)

Comment 39

9 years ago
(In reply to comment #28)
> Is it okay to throw out or weaken existing tests in order to 
> obtain a set of tests that both parsers pass?

I don't know really. What should I do with the tests for bug 448564? (Indeed these tests went orange with the latest patch with the HTML5 parser turned off.)

It seems clear that when the HTML5 is turned on, the <optgroup> testing there will be obsolete, but until then, how do I drive the test failures of the HTML5 parser down without weakening tests for the old parser as long as it's on-by-default?

(I'm assuming that it's uncontroversial that HTML5 doesn't imply <select> on <optgroup> since that behavior is a Gecko-specific oddity.)
I think we should just weaken the tests for the old parser. Trying to do something better is going to be overcomplicated and a waste of time.
(Assignee)

Comment 41

9 years ago
In order to proceed on the issue of making the HTML5 parsing-enabled app simple to distribute under the GPL, I suggest that we start by including only the .java files that are strictly necessary to provide the preferred form for making modifications to the generated C++ but placing them into such a directory structure that it's super-simple to just copy the entire Java source tree on top of the files and have the same paths work. This way we can also check in more of the Java source tree later without having to make a decision about the rest of it now.

The structure of the Java source tree is htmlparser/src/nu/validator/htmlparser/impl/*.java. This needs to be rooted somewhere. I suggest rooting it near the C++ files which would yield
content/html/parser/htmlparser/src/nu/validator/htmlparser/impl/*.java
(That's ugly, but the alternatives I can think of are all worse somehow.)

Is this OK? Can all platforms that need to host the source tree handle paths of that length?

According to licensing at mozilla.org, the translator doesn't have to accompany the source from licensing POV, so this bug doesn't need to block on checking in the translator. I opened bug 499141 about checking it in (assuming we want it in).

- -

Removing bug 486123 and bug 495407 from deps, because Tp is green when the HTML5 parser is preffed off.
No longer depends on: 486123, 495407
First off, why does the new parser live in /content rather than in /parser?

And why not just place the javafiles in something like

<path for C++ code>/javasrc/*.java

What's the advantage of having the full java path? (please keep in mind i'm not a java guy)
Comments so far:

>diff --git a/content/base/public/nsINode.h b/content/base/public/nsINode.h
>--- a/content/base/public/nsINode.h
>+++ b/content/base/public/nsINode.h
>@@ -431,20 +431,23 @@ public:
>   /**
>    * Remove a child from this node.  This method handles calling UnbindFromTree
>    * on the child appropriately.
>    *
>    * @param aIndex the index of the child to remove
>    * @param aNotify whether to notify the document (current document for
>    *        nsIContent, and |this| for nsIDocument) that the remove has
>    *        occurred
>+   * @param aMutationEvent whether to fire a mutation event
>    *
>    * Note: If there is no child at aIndex, this method will simply do nothing.
>    */
>-  virtual nsresult RemoveChildAt(PRUint32 aIndex, PRBool aNotify) = 0;
>+  virtual nsresult RemoveChildAt(PRUint32 aIndex, 
>+                                 PRBool aNotify, 
>+                                 PRBool aMutationEvent = PR_TRUE) = 0;

Urg, I'd really prefer not to have default values on interfaces. Why do we need this second bool at all? What we do elsewhere is to do the mutation with aNotify=false, and then manually notify. (Though I think for removals we'd manually notify first, and then do the mutation).

>diff --git a/content/base/src/nsContentUtils.cpp b/content/base/src/nsContentUtils.cpp
>@@ -3579,16 +3580,68 @@ nsContentUtils::CreateContextualFragment
>   nsresult rv;
>   nsCOMPtr<nsINode> node = do_QueryInterface(aContextNode);
>   NS_ENSURE_TRUE(node, NS_ERROR_NOT_AVAILABLE);
> 
>   // If we don't have a document here, we can't get the right security context
>   // for compiling event handlers... so just bail out.
>   nsCOMPtr<nsIDocument> document = node->GetOwnerDoc();
>   NS_ENSURE_TRUE(document, NS_ERROR_NOT_AVAILABLE);
>+  
>+  PRBool bCaseSensitive = document->IsCaseSensitive();
>+
>+  nsCOMPtr<nsIHTMLDocument> htmlDoc(do_QueryInterface(document));
>+  PRBool bHTML = htmlDoc && !bCaseSensitive;
>+
>+  if (bHTML && nsContentUtils::GetBoolPref("html5.enable", PR_FALSE)) {

You should use the nsContentUtils::AddBoolPrefVarCache to avoid calling into the pref service every time. It's a really simple API to use.

>+    // See if the document has a cached fragment parser. nsHTMLDocument is the
>+    // only one that should really have one at the moment.
>+    nsCOMPtr<nsIParser> parser = document->GetFragmentParser();
>+    if (parser) {
>+      // Get the parser ready to use.
>+      parser->Reset();
>+    }
>+    else {
>+      // Create a new parser for this operation.
>+      parser = nsHtml5Module::NewHtml5Parser();
>+      if (!parser) {
>+        return NS_ERROR_OUT_OF_MEMORY;
>+      }
>+      document->SetFragmentParser(parser);
>+    }
>+    nsCOMPtr<nsIDOMDocumentFragment> frag;
>+    rv = NS_NewDocumentFragment(getter_AddRefs(frag), document->NodeInfoManager());
>+    NS_ENSURE_SUCCESS(rv, rv);
>+    
>+    nsCOMPtr<nsIContent> contextAsContent = do_QueryInterface(aContextNode);
>+    if (contextAsContent && !contextAsContent->IsNodeOfType(nsINode::eELEMENT)) {
>+      contextAsContent = contextAsContent->GetParent();
>+      if (contextAsContent && !contextAsContent->IsNodeOfType(nsINode::eELEMENT)) {
>+        // can this even happen?
>+        contextAsContent = nsnull;

I'd imagine people can pass all sorts of nodes to this function as I think we expose it to content directly (not just through innerHTML). I believe it lives on nsIDOMNSRange.

>diff --git a/content/html/document/reftests/bug448564-2_well-formed.html
>diff --git a/content/html/document/reftests/bug448564-3_well-formed.html 
>diff --git a/content/html/document/reftests/bug448564-5_well-formed.html 

Need to take a look at these more in detail. Once we enable the parser by default, might be good to make these tests test that things are how they should be in the new world, rather than making the tests weaker.


See also comment 42, I think we'll want the new parser files to live in parser/html5 or some such. We can and should still link it with the content code though.
(In reply to comment #42)
> First off, why does the new parser live in /content rather than in /parser?
> 
> And why not just place the javafiles in something like
> 
> <path for C++ code>/javasrc/*.java

+1 on the above, and stronger on favoring parser/html5parser or whatever the name should be. Move the current htmlparser to a to-be-removed name if you can, so we can retire it and give the most straightforward and version-independent name to the new parser.

/be
Just parser/html works, leaving parser/htmlparser as the overlong name to retire.

/be
In general we need to do something about these enormous lists. At the very least, I think we should use macro-tricks a'la our atom lists to avoid having to repeat the same list multiple times. This will also keep the actual code in files that are readable.

That way we could even remove it from the patch and keep the patch readable :)

>diff --git a/content/html/document/src/nsHTMLContentSink.cpp b/content/html/document/src/nsHTMLContentSink.cpp
>@@ -649,25 +650,35 @@ nsresult
> nsHTMLDocument::StartDocumentLoad(const char* aCommand,
>                                   nsIChannel* aChannel,
>                                   nsILoadGroup* aLoadGroup,
>                                   nsISupports* aContainer,
>                                   nsIStreamListener **aDocListener,
>                                   PRBool aReset,
>                                   nsIContentSink* aSink)
> {
>+  PRBool loadAsHtml5 = nsContentUtils::GetBoolPref("html5.enable", PR_FALSE);
>+  if (aSink) {
>+    loadAsHtml5 = PR_FALSE;
>+  }
>+
>   nsCAutoString contentType;
>   aChannel->GetContentType(contentType);
> 
>   if (contentType.Equals("application/xhtml+xml") &&
>       (!aCommand || nsCRT::strcmp(aCommand, "view-source") != 0)) {
>     // We're parsing XHTML as XML, remember that.
> 
>     mIsRegularHTML = PR_FALSE;
>     mCompatMode = eCompatibility_FullStandards;
>+    loadAsHtml5 = PR_FALSE;
>+  }
>+  
>+  if (!(contentType.Equals("text/html") && aCommand && !nsCRT::strcmp(aCommand, "view"))) {
>+    loadAsHtml5 = PR_FALSE;

Why don't we want documents loaded with other commands to be loaded using the HTML 5 parser? Granted, I'm not sure if we ever use the html parser for non-view documents given that DOMParser and XMLHttpRequest only does XML documents currently.

>@@ -1910,16 +1933,66 @@ nsHTMLDocument::OpenCommon(const nsACStr
>     SetIsInitialDocument(PR_FALSE);
> 
>     nsCOMPtr<nsIScriptGlobalObject> newScope(do_QueryReferent(mScopeObject));
>     if (oldScope && newScope != oldScope) {
>       nsContentUtils::ReparentContentWrappersInScope(oldScope, newScope);
>     }
>   }
> 
>+  if (loadAsHtml5) {
>+    // Really zap all children (copied from nsDocument.cpp -- maybe factor into a method)
>+    PRUint32 count = mChildren.ChildCount();
>+    { // Scope for update
>+      MOZ_AUTO_DOC_UPDATE(this, UPDATE_CONTENT_MODEL, PR_TRUE);    
>+      for (PRInt32 i = PRInt32(count) - 1; i >= 0; i--) {
>+        nsCOMPtr<nsIContent> content = mChildren.ChildAt(i);
>+        // XXXbz this is backwards from how ContentRemoved normally works.  That
>+        // is, usually it's dispatched after the content has been removed from
>+        // the tree.
>+        nsNodeUtils::ContentRemoved(this, content, i);
>+        content->UnbindFromTree();
>+        mChildren.RemoveChildAt(i);
>+      }
>+    }
>+  }
>+
>+  // XXX This is a nasty workaround for a scrollbar code bug
>+  // (http://bugzilla.mozilla.org/show_bug.cgi?id=55334).
>+
>+  // Hold on to our root element
>+  nsCOMPtr<nsIContent> root = GetRootContent();
>+
>+  if (root) {
>+    PRInt32 rootIndex = mChildren.IndexOfChild(root);
>+    NS_ASSERTION(rootIndex >= 0, "Root must be in list!");
>+    
>+    PRUint32 count = root->GetChildCount();
>+
>+    // Remove all the children from the root.
>+    while (count-- > 0) {
>+      root->RemoveChildAt(count, PR_TRUE);
>+    }
>+
>+    count = root->GetAttrCount();
>+
>+    // Remove all attributes from the root element
>+    while (count-- > 0) {
>+      const nsAttrName* name = root->GetAttrNameAt(count);
>+      // Hold a strong reference here so that the atom doesn't go away during
>+      // UnsetAttr.
>+      nsCOMPtr<nsIAtom> localName = name->LocalName();
>+      root->UnsetAttr(name->NamespaceID(), localName, PR_FALSE);
>+    }
>+
>+    // Remove the root from the childlist
>+    mChildren.RemoveChildAt(rootIndex);
>+    mCachedRootContent = nsnull;
>+  }
>+
>   // Call Reset(), this will now do the full reset
>   Reset(channel, group);
>   if (baseURI) {
>     mDocumentBaseURI = baseURI;
>   }
> 
>   if (IsEditingOn()) {
>     // Reset() blows away all event listeners in the document, and our

This hack is removed from trunk, so we don't want it back. I think you can remove all of this now that Reset properly resets things.

>diff --git a/content/html/parser/src/jArray.h b/content/html/parser/src/jArray.h
>+template<class T, class L> 
>+class jArray {
>+  private:
>+    T* arr;
>+  public:
>+    L length;
>+    jArray(T* const a, L const len);
>+    jArray(L const len);
>+    jArray(const jArray<T,L>& other);
>+    jArray();
>+    operator T*() { return arr; }
>+    T& operator[] (L const index) { return arr[index]; }
>+    void release() { delete[] arr; }
>+    L binarySearch(T const elem);
>+};

Why can't you just use nsTArray? Or even a simple C-array?

>diff --git a/content/html/parser/src/nsHtml5ArrayCopy.h b/content/html/parser/src/nsHtml5ArrayCopy.h

>+class nsHtml5ArrayCopy {
>+  public: 
>+    static
>+    inline
>+    void
>+    arraycopy(PRUnichar* source, PRInt32 sourceOffset, PRUnichar* target, PRInt32 targetOffset, PRInt32 length)
>+    {
>+      memcpy(&(target[targetOffset]), &(source[sourceOffset]), length * sizeof(PRUnichar));
>+    }
>+    
>+    static
>+    inline
>+    void
>+    arraycopy(PRUnichar* source, PRUnichar* target, PRInt32 length)
>+    {
>+      memcpy(target, source, length * sizeof(PRUnichar));
>+    }
>+    
>+    static
>+    inline
>+    void
>+    arraycopy(nsString** source, nsString** target, PRInt32 length)
>+    {
>+      memcpy(target, source, length * sizeof(nsString*));
>+    }
>+    
>+    static
>+    inline
>+    void
>+    arraycopy(nsHtml5AttributeName** source, nsHtml5AttributeName** target, PRInt32 length)
>+    {
>+      memcpy(target, source, length * sizeof(nsHtml5AttributeName*));
>+    }
>+    
>+    static
>+    inline
>+    void
>+    arraycopy(nsHtml5StackNode** source, nsHtml5StackNode** target, PRInt32 length)
>+    {
>+      memcpy(target, source, length * sizeof(nsHtml5StackNode*));
>+    }
>+    
>+    static
>+    inline
>+    void
>+    arraycopy(nsHtml5StackNode** arr, PRInt32 sourceOffset, PRInt32 targetOffset, PRInt32 length)
>+    {
>+      memmove(&(arr[targetOffset]), &(arr[sourceOffset]), length * sizeof(nsHtml5StackNode*));
>+    }

Hmm.. I'm not really a big fan of this. Can't you just call memmove directly? I suspect if you use nsTArray you won't need a lot of these.

>diff --git a/content/html/parser/src/nsHtml5AtomList.h b/content/html/parser/src/nsHtml5AtomList.h

We might want to put these into nsGkAtomsList.h, that'll work fine as long as we link the parser into gklayout. But it's fine to keep separately for now.

>diff --git a/content/html/parser/src/nsHtml5AttributeName.cpp b/content/html/parser/src/nsHtml5AttributeName.cpp
>+nsHtml5AttributeName::bufToHash(PRUnichar* buf, PRInt32 len)
>+{
>+  PRInt32 hash2 = 0;
>+  PRInt32 hash = len;
>+  hash <<= 5;
>+  hash += buf[0] - 0x60;
>+  PRInt32 j = len;
>+  for (PRInt32 i = 0; i < 4 && j > 0; i++) {
>+    j--;
>+    hash <<= 5;
>+    hash += buf[j] - 0x60;
>+    hash2 <<= 6;
>+    hash2 += buf[i] - 0x5F;
>+  }
>+  return hash ^ hash2;
>+}

I take it this is a generated perfect-hash function?
There's a number of cases like this:

static jArray<PRUnichar,PRInt32> SCRIPT_ARR;

Unfortunately compilers aren't reliable enough at initializing static objects, so this won't work cross all platforms. Static variables that don't require constructors (such as a pointer to an array) is ok though.
(In reply to comment #46)
> In general we need to do something about these enormous lists. At the very
> least, I think we should use macro-tricks a'la our atom lists to avoid having
> to repeat the same list multiple times. This will also keep the actual code in
> files that are readable.

Could use a map macro, e.g.

#define LIST(_) \
  _(tag1)       \
  ...           \
  _(tagN)

where tag1 through tagN are written out as literals, and then you call LIST with different macro names for the _ parameter. Or (better if you want to write down severa facts about each tag), use something js/src/jsopcode.tbl and its #include consumers, who define custom OPDEF macros.

> >+class nsHtml5ArrayCopy {
> >+  public: 
> >+    static
> >+    inline
> >+    void

These three (storage class, qualifier(s) and type) should go on one line, at least in typical C++ style around here.

> >+nsHtml5AttributeName::bufToHash(PRUnichar* buf, PRInt32 len)
> >+{
> >+  PRInt32 hash2 = 0;
> >+  PRInt32 hash = len;
> >+  hash <<= 5;
> >+  hash += buf[0] - 0x60;
> >+  PRInt32 j = len;
> >+  for (PRInt32 i = 0; i < 4 && j > 0; i++) {
> >+    j--;
> >+    hash <<= 5;
> >+    hash += buf[j] - 0x60;
> >+    hash2 <<= 6;
> >+    hash2 += buf[i] - 0x5F;
> >+  }
> >+  return hash ^ hash2;
> >+}
> 
> I take it this is a generated perfect-hash function?

Doesn't look like it. GNU perfect (non-minimal) hash function generator:

http://www.gnu.org/software/gperf/

/be
(Assignee)

Comment 49

9 years ago
(In reply to comment #42)
> First off, why does the new parser live in /content rather than in /parser?

Because /content and /parser are in different shared libraries and changing things so that they link into the same shared library and so that parser stuff sees headers that are supposed to be /content-private were outside my understanding of the build system. When it became obvious that no one else would make the change for me in a timely fashion and that trying to make change myself while totally lacking clue about the build system would have been a time sink, I moved the parser under /content to get going.

> And why not just place the javafiles in something like
> 
> <path for C++ code>/javasrc/*.java
> 
> What's the advantage of having the full java path? (please keep in mind i'm not
> a java guy)

The advantage of having the full Java path is that you can copy the whole Java parser source distribution over the directories and have the same directory structure for the key files that are used for translator input. As for why the Java parser source distribution has those horrible telescoping directories, that's the Java convention and not following the convention would be even worse than following the convention, since tools assume the convention.

<path for C++ code>/javasrc/*.java works if we just want to make sure the .java files get delivered and don't want to make the delivery method Java IDE friendly. (For IDE-friendliness, you'd need more files than the ones that are used for translation input anyway.)

(In reply to comment #43)
> Comments so far:
> 
> >diff --git a/content/base/public/nsINode.h b/content/base/public/nsINode.h
> >--- a/content/base/public/nsINode.h
> >+++ b/content/base/public/nsINode.h
> >@@ -431,20 +431,23 @@ public:
> >   /**
> >    * Remove a child from this node.  This method handles calling UnbindFromTree
> >    * on the child appropriately.
> >    *
> >    * @param aIndex the index of the child to remove
> >    * @param aNotify whether to notify the document (current document for
> >    *        nsIContent, and |this| for nsIDocument) that the remove has
> >    *        occurred
> >+   * @param aMutationEvent whether to fire a mutation event
> >    *
> >    * Note: If there is no child at aIndex, this method will simply do nothing.
> >    */
> >-  virtual nsresult RemoveChildAt(PRUint32 aIndex, PRBool aNotify) = 0;
> >+  virtual nsresult RemoveChildAt(PRUint32 aIndex, 
> >+                                 PRBool aNotify, 
> >+                                 PRBool aMutationEvent = PR_TRUE) = 0;
> 
> Urg, I'd really prefer not to have default values on interfaces. Why do we need
> this second bool at all?

Because the mutation event and notification code is intertwined into the RemoveChildAt implementation and getting it out of there would be more complex. I added another boolean per bug 484988 comment 1. I added the default on my own initiative in order to minimize the changes to callers.

> What we do elsewhere is to do the mutation with aNotify=false, and then manually notify.

I tried doing that first, but the adoption agency algorithm caused double frame construction.

> (Though I think for removals we'd
> manually notify first, and then do the mutation).

I tried both ways. Didn't work right with the frame constructor.

> >diff --git a/content/base/src/nsContentUtils.cpp b/content/base/src/nsContentUtils.cpp
> >@@ -3579,16 +3580,68 @@ nsContentUtils::CreateContextualFragment
> >   nsresult rv;
> >   nsCOMPtr<nsINode> node = do_QueryInterface(aContextNode);
> >   NS_ENSURE_TRUE(node, NS_ERROR_NOT_AVAILABLE);
> > 
> >   // If we don't have a document here, we can't get the right security context
> >   // for compiling event handlers... so just bail out.
> >   nsCOMPtr<nsIDocument> document = node->GetOwnerDoc();
> >   NS_ENSURE_TRUE(document, NS_ERROR_NOT_AVAILABLE);
> >+  
> >+  PRBool bCaseSensitive = document->IsCaseSensitive();
> >+
> >+  nsCOMPtr<nsIHTMLDocument> htmlDoc(do_QueryInterface(document));
> >+  PRBool bHTML = htmlDoc && !bCaseSensitive;
> >+
> >+  if (bHTML && nsContentUtils::GetBoolPref("html5.enable", PR_FALSE)) {
> 
> You should use the nsContentUtils::AddBoolPrefVarCache to avoid calling into
> the pref service every time. It's a really simple API to use.

OK.

> >+    // See if the document has a cached fragment parser. nsHTMLDocument is the
> >+    // only one that should really have one at the moment.
> >+    nsCOMPtr<nsIParser> parser = document->GetFragmentParser();
> >+    if (parser) {
> >+      // Get the parser ready to use.
> >+      parser->Reset();
> >+    }
> >+    else {
> >+      // Create a new parser for this operation.
> >+      parser = nsHtml5Module::NewHtml5Parser();
> >+      if (!parser) {
> >+        return NS_ERROR_OUT_OF_MEMORY;
> >+      }
> >+      document->SetFragmentParser(parser);
> >+    }
> >+    nsCOMPtr<nsIDOMDocumentFragment> frag;
> >+    rv = NS_NewDocumentFragment(getter_AddRefs(frag), document->NodeInfoManager());
> >+    NS_ENSURE_SUCCESS(rv, rv);
> >+    
> >+    nsCOMPtr<nsIContent> contextAsContent = do_QueryInterface(aContextNode);
> >+    if (contextAsContent && !contextAsContent->IsNodeOfType(nsINode::eELEMENT)) {
> >+      contextAsContent = contextAsContent->GetParent();
> >+      if (contextAsContent && !contextAsContent->IsNodeOfType(nsINode::eELEMENT)) {
> >+        // can this even happen?
> >+        contextAsContent = nsnull;
> 
> I'd imagine people can pass all sorts of nodes to this function as I think we
> expose it to content directly (not just through innerHTML). I believe it lives
> on nsIDOMNSRange.

Doesn't that code offer similar level of coverage for weird cases as the old code? Or did you mean the answer to "can this ever happen?" is "yes" and the comment should just go away?

> >diff --git a/content/html/document/reftests/bug448564-2_well-formed.html
> >diff --git a/content/html/document/reftests/bug448564-3_well-formed.html 
> >diff --git a/content/html/document/reftests/bug448564-5_well-formed.html 
> 
> Need to take a look at these more in detail. Once we enable the parser by
> default, might be good to make these tests test that things are how they should
> be in the new world, rather than making the tests weaker.

Should I revert the test case changes?

> See also comment 42, I think we'll want the new parser files to live in
> parser/html5 or some such. We can and should still link it with the content
> code though.

I think I need help to make that happen.

(In reply to comment #45)
> Just parser/html works, leaving parser/htmlparser as the overlong name to
> retire.

I think parser/html makes more sense than parser/html5 given that the parser is meant for all of text/html.

(In reply to comment #46)
> In general we need to do something about these enormous lists. At the very
> least, I think we should use macro-tricks a'la our atom lists to avoid having
> to repeat the same list multiple times. This will also keep the actual code in
> files that are readable.

What's repeated multiple times?

This comes back to the question of keeping the translator super-simple vs. making the translator more complex in order to make the output nicer. I've tried to keep the translator super-simple so far to avoid sinking a lot of time into the translator.

Is this a blocker for landing preffed off?

> Why don't we want documents loaded with other commands to be loaded using the
> HTML 5 parser? Granted, I'm not sure if we ever use the html parser for
> non-view documents given that DOMParser and XMLHttpRequest only does XML
> documents currently.

View source, feed content sanitizing, clipboard paste and bookmark import with the new parser aren't done yet. This is just to get the testing clock started with the Web-exposed functionality.

> >   if (IsEditingOn()) {
> >     // Reset() blows away all event listeners in the document, and our
> 
> This hack is removed from trunk, so we don't want it back. I think you can
> remove all of this now that Reset properly resets things.

This has to be a bad automatic merge by hg that I didn't review properly. Thanks!

> >diff --git a/content/html/parser/src/jArray.h b/content/html/parser/src/jArray.h
> >+template<class T, class L> 
> >+class jArray {
> >+  private:
> >+    T* arr;
> >+  public:
> >+    L length;
> >+    jArray(T* const a, L const len);
> >+    jArray(L const len);
> >+    jArray(const jArray<T,L>& other);
> >+    jArray();
> >+    operator T*() { return arr; }
> >+    T& operator[] (L const index) { return arr[index]; }
> >+    void release() { delete[] arr; }
> >+    L binarySearch(T const elem);
> >+};
> 
> Why can't you just use nsTArray?

These arrays get returned by methods. Returning by value yields the right semantics, because Java arrays have immutable length. Perhaps the right length passing semantics with the same code structure could be achieved by passing nsTArrays by value if they were never resized, but wouldn't it be more confusing to use a template designed for resizable arrays while making sure that they don't actually get resized in ways that break passing the length by value?

The second argument to the jArray template is ugly. Up-front, I wanted the translator to be extensible to target non-Gecko environments and I wanted to make the hand-written support template not to need editing in order to work in a non-Gecko environment. By now, it's obvious that non-generated support files need hand-tweaking in order to support non-NSPR types, so I guess the second template argument should just go away and PRInt32 be hard-coded into the template file.

> Or even a simple C-array?

When the code doesn't actually care about the length or the array, the translator generates simple C arrays. (The Java arrays whose length doesn't matter are annotated with @NoLength for this purpose.)

> >+class nsHtml5ArrayCopy {
> Hmm.. I'm not really a big fan of this. Can't you just call memmove directly? I
> suspect if you use nsTArray you won't need a lot of these.

These are needed, because the translator is a super-simple syntactic transform that doesn't even have a symbol table. If the translator knew the types of the variables it deals with, I could make it generate the right memmoves directly. The nsHtml5ArrayCopy stuff leaves the type-based dispatch to the C++ compiler.

I guess there's now enough demand for the kind of transformations that aren't simple type-unaware syntactic rewriting that it would make sense for me to spend time to make the translator type-aware to enable more advanced transformations.

> >diff --git a/content/html/parser/src/nsHtml5AtomList.h b/content/html/parser/src/nsHtml5AtomList.h
> 
> We might want to put these into nsGkAtomsList.h, that'll work fine as long as
> we link the parser into gklayout. But it's fine to keep separately for now.

I had them in nsGkAtomList initially until I was advised I could have a separate atom list. I don't remember the details of that discussion.

> >diff --git a/content/html/parser/src/nsHtml5AttributeName.cpp b/content/html/parser/src/nsHtml5AttributeName.cpp
> >+nsHtml5AttributeName::bufToHash(PRUnichar* buf, PRInt32 len)
> >+{
> >+  PRInt32 hash2 = 0;
> >+  PRInt32 hash = len;
> >+  hash <<= 5;
> >+  hash += buf[0] - 0x60;
> >+  PRInt32 j = len;
> >+  for (PRInt32 i = 0; i < 4 && j > 0; i++) {
> >+    j--;
> >+    hash <<= 5;
> >+    hash += buf[j] - 0x60;
> >+    hash2 <<= 6;
> >+    hash2 += buf[i] - 0x5F;
> >+  }
> >+  return hash ^ hash2;
> >+}
> 
> I take it this is a generated perfect-hash function?

It's manually written to have no collisions with the well-known names.

(In reply to comment #47)
> There's a number of cases like this:
> 
> static jArray<PRUnichar,PRInt32> SCRIPT_ARR;
> 
> Unfortunately compilers aren't reliable enough at initializing static objects,
> so this won't work cross all platforms. Static variables that don't require
> constructors (such as a pointer to an array) is ok though.

Can the declaration stay as-is with the real initialization moving to initializeStatics()?

(In reply to comment #48)
> > >+class nsHtml5ArrayCopy {
> > >+  public: 
> > >+    static
> > >+    inline
> > >+    void
> 
> These three (storage class, qualifier(s) and type) should go on one line, at
> least in typical C++ style around here.

Oops. I'll fix.

- -

I thought some more about the issue of hitting new/delete for stack nodes. What I said earlier was pretty badly bogus. It should be possible to introduce different node types for the stack and the list of formatting elements and have both the stack and the list be arrays of those nodes rather than pointers to nodes. The node identity can be the object identity of the wrapped element node. I *think* this would work with or without taint.

I'd like to start the Web compat testing clock (i.e. land the parser land preffed off) before transforming the tree builder like this, though.
(Assignee)

Comment 50

9 years ago
Created attachment 384287 [details] [diff] [review]
Fix merge issue, cache pref, fix static inline void and cycle collection
Attachment #383652 - Attachment is obsolete: true
Attachment #384287 - Flags: superreview?(jonas)
Attachment #384287 - Flags: review?(mrbkap)
Attachment #383652 - Flags: superreview?(jonas)
Attachment #383652 - Flags: review?(mrbkap)
(Assignee)

Comment 51

9 years ago
Created attachment 384288 [details] [diff] [review]
Interdiff of the previous and current patch
(Assignee)

Comment 52

9 years ago
The remaining test failures with the current patch are:
 * One nsComponentManagerImpl leaks on Windows
 * mochitest-chrome times out on Linux
Is the nsComponentManagerImpl leak the leak that's causing cycles collector slowness and ultimately talos timing out?
(Assignee)

Comment 54

9 years ago
The nsComponentManagerImpl leak is with HTML5 parser preffed off, so it should be unrelated to the cycle collector code in the HTML5 parser.

I merged with a known-green point from the trunk after the last comment and now mochitest-plain timed out (with the old parser). Any advice on how to troubleshoot?
> The nsComponentManagerImpl leak is with HTML5 parser preffed off,

Is this one of the known random oranges?  Did it appear on all try machines for you, or just some?

> * mochitest-chrome times out on Linux

Similar questions...
(In reply to comment #50)
> Created an attachment (id=384287) [details]
> Fix merge issue, cache pref, fix static inline void and cycle collection

I'm guessing you made the parser a cycle collection participant to help track down the leak.  Is that a permanent design choice?  I ask because the cycle collector is main thread only, which means cycle collection could complicate off-main thread parsing.
(Assignee)

Comment 57

9 years ago
(In reply to comment #55)
> > The nsComponentManagerImpl leak is with HTML5 parser preffed off,
> 
> Is this one of the known random oranges? 

If it is, I didn't find it. Bugs that I checked about leaks in the nsComponentManagerImpl area seem ancient and don't mention random orange.

> Did it appear on all try machines for you, or just some?

Only on Windows.

The timeout on Linux turned out to be known random.

I'm not sure yet if the orange on Mac was known random.

(In reply to comment #56)
> I'm guessing you made the parser a cycle collection participant to help track
> down the leak.  Is that a permanent design choice?  I ask because the cycle
> collector is main thread only, which means cycle collection could complicate
> off-main thread parsing.

It's not a permanent design choice. In fact, I had tried to avoid implementing more temporary measures that would get thrown away with the off-the-main-thread move. However, evidence of the HTML5 Tp problems points to a leak, so I figured the cycle collector might help. It didn't.
Does your nsComponentManagerImpl leak look like this, from my nothing-to-do-with-HTML-parser build?
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245727467.1245734447.8230.gz
(Assignee)

Comment 59

9 years ago
(In reply to comment #58)
> Does your nsComponentManagerImpl leak look like this, from my
> nothing-to-do-with-HTML-parser build?
> http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245727467.1245734447.8230.gz

It looks just like that.
(Assignee)

Comment 60

9 years ago
The Mac orange was random, too. Now it appears that none of the orange is caused by this patch.
I'm warming up to this patch like a snake on a Texas highway -- I hope a big semi doesn't run me over.

This can land once reviewed. We want the Java code too, hosted sensibly. Under parser/html if you please -- DLL boundaries shouldn't stop us, we are the masters and they are the servants.

/be
>diff --git a/content/html/parser/src/nsHtml5Parser.cpp b/content/html/parser/src/nsHtml5Parser.cpp
>+nsHtml5Parser::~nsHtml5Parser()
>+{
>+  while (mFirstBuffer->next) {
>+    nsHtml5UTF16Buffer* oldBuf = mFirstBuffer;
>+    mFirstBuffer = mFirstBuffer->next;
>+    delete oldBuf;
>+  }
>+  delete mFirstBuffer;

while(mFirstBuffer) {
  nsHtml5UTF16Buffer* old = mFirstBuffer;
  mFirstBuffer = mFirstBuffer->next;
  delete old;
}

>+  delete mTokenizer;
>+  delete mTreeBuilder;
>+  if (mSniffingBuffer) {
>+    delete[] mSniffingBuffer;
>+  }

No need to do the nullcheck. delete on null is safe.

Can you turn these into nsAutoPtr/nsAutoArrayPtrs instead?

>+// copied from HTML content sink
>+static PRBool
>+IsScriptEnabled(nsIDocument *aDoc, nsIDocShell *aContainer)
>+{
>+  NS_ENSURE_TRUE(aDoc && aContainer, PR_TRUE);
>+
>+  nsCOMPtr<nsIScriptGlobalObject> globalObject = aDoc->GetScriptGlobalObject();
>+
>+  // Getting context is tricky if the document hasn't had its
>+  // GlobalObject set yet
>+  if (!globalObject) {
>+    nsCOMPtr<nsIScriptGlobalObjectOwner> owner = do_GetInterface(aContainer);
>+    NS_ENSURE_TRUE(owner, PR_TRUE);
>+
>+    globalObject = owner->GetScriptGlobalObject();
>+    NS_ENSURE_TRUE(globalObject, PR_TRUE);
>+  }

Hmm.. it's scary to have two different code paths here. Are you sure they can never yield different objects? When can you get into this with aDoc not having a scriptglobalobject?

At the very least we should ensure that aDoc and aContainer are each others container resp. document.

>+NS_IMETHODIMP_(void) 
>+nsHtml5Parser::GetCommand(nsCString& aCommand)
>+{
>+  aCommand.Assign("loadAsData");
>+}

Huh? Shouldn't this be "view"?

>+NS_IMETHODIMP_(void)
>+nsHtml5Parser::SetParserFilter(nsIParserFilter* aFilter)
>+{
>+  NS_ASSERTION(PR_TRUE, "Attempt to set a parser filter on HTML5 parser.");
>+}

Use NS_NOTREACHED or NS_ERROR instead.

>+NS_IMETHODIMP
>+nsHtml5Parser::GetChannel(nsIChannel** aChannel)
>+{
>+  nsresult result = NS_ERROR_NOT_AVAILABLE;
>+  if (mRequest) {
>+    result = CallQueryInterface(mRequest, aChannel);
>+  }
>+  return result;
>+}

return mRequest ? CallQueryInterface(mRequest, aChannel) :
                  NS_ERROR_NOT_AVAILABLE;

>+nsHtml5Parser::BlockParser()
>+{
>+  NS_PRECONDITION((!mFragmentMode), "Must not block in fragment mode.");
>+  mBlocked = PR_TRUE;
>+}

No need for the parenthesis here. That's a pattern you're using in a lot of assertions/preconditions

>+nsHtml5Parser::Parse(nsIURI* aURL, // legacy parameter; ignored
>+                     nsIRequestObserver* aObserver,
>+                     void* aKey, 
>+                     nsDTDMode aMode) // legacy; ignored
>+{
>+  mObserver = aObserver;
>+  mRootContextKey = aKey;
>+  mCanInterruptParser = PR_TRUE;
>+  NS_ASSERTION((mLifeCycle == NOT_STARTED), "Tried to start parse without initializing the parser properly.");
>+  return NS_OK;
>+}

Seems like the assertion should be a precondition.

>+NS_IMETHODIMP
>+nsHtml5Parser::Parse(const nsAString& aSourceBuffer,
>+                     void* aKey,
>+                     const nsACString& aContentType, // ignored
>+                     PRBool aLastCall,
>+                     nsDTDMode aMode) // ignored
>+{
>+  NS_PRECONDITION((!mFragmentMode), "Document.write called in fragment mode!");
>+  // Return early if the parser has processed EOF
>+  switch (mLifeCycle) {
>+    case TERMINATED:
>+      return NS_OK;
>+    case NOT_STARTED:
>+      mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));
>+      mTokenizer->start();
>+      mLifeCycle = PARSING;
>+      mParser = this;
>+      mCharsetSource = kCharsetFromOtherComponent;
>+      break;
>+    default:
>+      break;
>+  }
>+  
>+  if (aLastCall && aSourceBuffer.IsEmpty() && aKey == GetRootContextKey()) {
>+    // document.close()
>+    mLifeCycle = STREAM_ENDING;
>+    MaybePostContinueEvent();
>+    return NS_OK;
>+  }

Shouldn't this ensure to fire DOMContentLoaded and similar events?

>+  // XXX stop speculative script thread here
>+  
>+  PRInt32 lineNumberSave = mTokenizer->getLineNumber();
>+  
>+  // Maintain a reference to ourselves so we don't go away
>+  // till we're completely done.
>+  // XXX is this still necessary? -- hsivonen
>+  nsCOMPtr<nsIParser> kungFuDeathGrip(this);
>+  
>+  if (!aSourceBuffer.IsEmpty()) {
>+    nsHtml5UTF16Buffer* buffer = new nsHtml5UTF16Buffer(aSourceBuffer.Length());
>+    memcpy(buffer->getBuffer(), aSourceBuffer.BeginReading(), aSourceBuffer.Length() * sizeof(PRUnichar));
>+    buffer->setEnd(aSourceBuffer.Length());
>+    if (!mBlocked) {
>+      WillResumeImpl();
>+      WillParseImpl();
>+      while (buffer->hasMore()) {
>+        buffer->adjust(mLastWasCR);
>+        mLastWasCR = PR_FALSE;
>+        if (buffer->hasMore()) {
>+          mLastWasCR = mTokenizer->tokenizeBuffer(buffer);
>+          if (mScriptElement) {
>+            // tree builder will have flushed
>+            ExecuteScript();
>+          }
>+          if (mNeedsCharsetSwitch) {
>+            // XXX setup immediate reparse
>+            delete buffer;
>+            WillInterruptImpl();
>+            return NS_OK;

Can this really happen during a document.write?? Or rather, should it really be honored then?

>+          } else if (mBlocked) {
>+            // XXX is the tail insertion and script exec in the wrong order?
>+            WillInterruptImpl();
>+            break;
>+          } else {
>+            // Ignore suspensions
>+            continue;
>+          }
>+        }
>+      }
>+    }
>+    if (buffer->hasMore()) {
>+      nsHtml5UTF16Buffer* prevSearchBuf = nsnull;
>+      nsHtml5UTF16Buffer* searchBuf = mFirstBuffer;
>+      if (aKey) { // after document.open, the first level of document.write has null key
>+        while (searchBuf != mLastBuffer) {
>+          if (searchBuf->key == aKey) {
>+            buffer->next = searchBuf;
>+            if (prevSearchBuf) {
>+              prevSearchBuf->next = buffer;
>+            } else {
>+              mFirstBuffer = buffer;
>+            }
>+            break;
>+          }
>+          prevSearchBuf = searchBuf;
>+          searchBuf = searchBuf->next;
>+        }
>+      }
>+      if (searchBuf == mLastBuffer || !aKey) {
>+        // key was not found or we have a first-level write after document.open
>+        // we'll insert to the head of the queue
>+        nsHtml5UTF16Buffer* keyHolder = new nsHtml5UTF16Buffer(aKey);
>+        keyHolder->next = mFirstBuffer;
>+        buffer->next = keyHolder;
>+        mFirstBuffer = buffer;
>+      }
>+      MaybePostContinueEvent();

Could you add a comment describing what the above is trying to do. I take it this is trying to deal with nested document.writes which is quite a complicated algorithm to wrap your head around, so comments would definitely help.

>+nsHtml5Parser::Terminate(void)
...
>+  PRBool ready = ReadyToCallDidBuildModelImpl(PR_TRUE);
>+  NS_ASSERTION(ready, "Should always be ready to call DidBuildModel here.");
>+  return DidBuildModel(); // nsIContentSink
>+}

You need to make the declaration of |ready| be #ifdef DEBUG, otherwise non-debug builds will warn about an unused variable.

>+NS_IMETHODIMP
>+nsHtml5Parser::ParseFragment(const nsAString& aSourceBuffer,
>+                             nsISupports* aTargetNode,
>+                             nsIAtom* aContextLocalName,
>+                             PRInt32 aContextNamespace,
>+                             PRBool aQuirks)
>+{
>+  nsCOMPtr<nsIContent> target = do_QueryInterface(aTargetNode);
>+  NS_ASSERTION(target, "Target did not QI to nsIContent");
>+  mTreeBuilder->setFragmentContext(aContextLocalName, aContextNamespace, target, aQuirks);
>+  mFragmentMode = PR_TRUE;
>+  mCanInterruptParser = PR_FALSE;
>+  NS_ASSERTION((mLifeCycle == NOT_STARTED), "Tried to start parse without initializing the parser properly.");
>+  mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));

Does this matter for fragment parsing? Is it only for handling of <noscript> elements?

>+  mTokenizer->start();
>+  mLifeCycle = PARSING;
>+  mParser = this;
>+  mNodeInfoManager = target->GetOwnerDoc()->NodeInfoManager();
>+
>+  if (!aSourceBuffer.IsEmpty()) {
>+    PRBool lastWasCR = PR_FALSE;
>+    nsHtml5UTF16Buffer buffer(aSourceBuffer.Length());
>+    memcpy(buffer.getBuffer(), aSourceBuffer.BeginReading(), aSourceBuffer.Length() * sizeof(PRUnichar));
>+    buffer.setEnd(aSourceBuffer.Length());
>+    while (buffer.hasMore()) {
>+      buffer.adjust(lastWasCR);
>+      lastWasCR = PR_FALSE;
>+      if (buffer.hasMore()) {
>+        lastWasCR = mTokenizer->tokenizeBuffer(&buffer);
>+        if (mScriptElement) {
>+          nsCOMPtr<nsIScriptElement> script = do_QueryInterface(mScriptElement);
>+          NS_ASSERTION(script, "mScriptElement didn't QI to nsIScriptElement!");
>+          script->PreventExecution();
>+          mScriptElement = nsnull;

Can't you deal with this when creating the script elements instead? To avoid having to break out of the tokenizer on every script element during innerHTML parsing.

I guess it doesn't make that big of a difference perf-wise?

>+/**
>+ *  This is called by the networking library once the last block of data
>+ *  has been collected from the net.
>+ */
>+nsresult
>+nsHtml5Parser::OnStopRequest(nsIRequest* aRequest, nsISupports* aContext,
>+                        nsresult status)

Fix indentation

>+{
>+  mTreeBuilder->MaybeFlush();
>+  NS_ASSERTION((mRequest == aRequest), "Got Stop on wrong stream.");
>+  nsresult rv = NS_OK;
>+  
>+  if (!mUnicodeDecoder) {
>+    PRUint32 writeCount;
>+    rv = FinalizeSniffing(nsnull, 0, &writeCount, 0);
>+  }
>+  NS_ENSURE_SUCCESS(rv, rv);
>+  
>+  switch (mLifeCycle) {
>+    case TERMINATED:
>+      break;
>+    case NOT_STARTED:
>+      mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));
>+      mTokenizer->start();
>+      mLifeCycle = STREAM_ENDING;
>+      mParser = this;
>+      break;
>+    case STREAM_ENDING:
>+      NS_ERROR("OnStopRequest when the stream lifecycle was already ending.");
>+      break;
>+    default:
>+      mLifeCycle = STREAM_ENDING;
>+      break;
>+  }
>+
>+//  if (eOnStart == mStreamListenerState) {
>+    // If you're here, then OnDataAvailable() never got called.  Prior
>+    // to necko, we never dealt with this case, but the problem may
>+    // have existed.  Everybody can live with an empty input stream, so
>+    // just resume parsing.
>+//    ParseUntilSuspend();
>+//  }
>+//  mStreamStatus = status;
>+
>+//  if (mParserFilter)
>+//    mParserFilter->Finish();

What's all this commented out code? And is that comment as old as it seems? "prior to necko"??

>+  mStreamListenerState = eOnStop;
>+
>+  if (!mScriptsExecuting) {
>+    ParseUntilSuspend();
>+  }
>+
>+  // If the parser isn't enabled, we don't finish parsing till
>+  // it is reenabled.

This comment doesn't seem to correspond to any code. Unless it's meant to refer to the code right before it?
(Assignee)

Comment 63

9 years ago
(In reply to comment #62)
> >diff --git a/content/html/parser/src/nsHtml5Parser.cpp b/content/html/parser/src/nsHtml5Parser.cpp
> >+nsHtml5Parser::~nsHtml5Parser()
> >+{
> >+  while (mFirstBuffer->next) {
> >+    nsHtml5UTF16Buffer* oldBuf = mFirstBuffer;
> >+    mFirstBuffer = mFirstBuffer->next;
> >+    delete oldBuf;
> >+  }
> >+  delete mFirstBuffer;
> 
> while(mFirstBuffer) {
>   nsHtml5UTF16Buffer* old = mFirstBuffer;
>   mFirstBuffer = mFirstBuffer->next;
>   delete old;
> }

Fixed.

> >+  delete mTokenizer;
> >+  delete mTreeBuilder;
> >+  if (mSniffingBuffer) {
> >+    delete[] mSniffingBuffer;
> >+  }
> 
> No need to do the nullcheck. delete on null is safe.

Fixed.

> Can you turn these into nsAutoPtr/nsAutoArrayPtrs instead?

That seems like an overkill, because these objects are only ever held from one place. If we want to avoid manual new/delete here, mTokenizer and mTreeBuilder could, in principle, be allocated as parts of the nsHtml5Parser object instead of hitting new/delete and having them behind a pointer.

> >+// copied from HTML content sink
> >+static PRBool
> >+IsScriptEnabled(nsIDocument *aDoc, nsIDocShell *aContainer)
> >+{
> >+  NS_ENSURE_TRUE(aDoc && aContainer, PR_TRUE);
> >+
> >+  nsCOMPtr<nsIScriptGlobalObject> globalObject = aDoc->GetScriptGlobalObject();
> >+
> >+  // Getting context is tricky if the document hasn't had its
> >+  // GlobalObject set yet
> >+  if (!globalObject) {
> >+    nsCOMPtr<nsIScriptGlobalObjectOwner> owner = do_GetInterface(aContainer);
> >+    NS_ENSURE_TRUE(owner, PR_TRUE);
> >+
> >+    globalObject = owner->GetScriptGlobalObject();
> >+    NS_ENSURE_TRUE(globalObject, PR_TRUE);
> >+  }
> 
> Hmm.. it's scary to have two different code paths here. Are you sure they can
> never yield different objects? When can you get into this with aDoc not having
> a scriptglobalobject?

I have no idea. CVS blame says this code was put in by pollman for bug 77390 (r=harishd, sr=jst). I simply cargo-cult copied that code from the old parser. In fact, there's more in the script execution choreography that involves undocumented implicit important API contracts that I don't understand precisely. So far, things have failed when I've tried to streamline things instead of copying the old stuff exactly. 

It'll be "interesting" to verify that the script execution choreography actually is black-box equivalent to what HTML5 says and figuring out which side needs to change and how if they don't. (Bug 482913)

> At the very least we should ensure that aDoc and aContainer are each others
> container resp. document.

What's the right way to do this? (I don't see an obvious API for querying the containment directly. Do I need to walk the docshelltreeitems manually, etc.?)

> >+NS_IMETHODIMP_(void) 
> >+nsHtml5Parser::GetCommand(nsCString& aCommand)
> >+{
> >+  aCommand.Assign("loadAsData");
> >+}
> 
> Huh? Shouldn't this be "view"?

I guess it should. Fixed.

> >+NS_IMETHODIMP_(void)
> >+nsHtml5Parser::SetParserFilter(nsIParserFilter* aFilter)
> >+{
> >+  NS_ASSERTION(PR_TRUE, "Attempt to set a parser filter on HTML5 parser.");
> >+}
> 
> Use NS_NOTREACHED or NS_ERROR instead.

Used NS_ERROR.

> >+NS_IMETHODIMP
> >+nsHtml5Parser::GetChannel(nsIChannel** aChannel)
> >+{
> >+  nsresult result = NS_ERROR_NOT_AVAILABLE;
> >+  if (mRequest) {
> >+    result = CallQueryInterface(mRequest, aChannel);
> >+  }
> >+  return result;
> >+}
> 
> return mRequest ? CallQueryInterface(mRequest, aChannel) :
>                   NS_ERROR_NOT_AVAILABLE;

Copied from nsParser, but fixed. :-)

> >+nsHtml5Parser::BlockParser()
> >+{
> >+  NS_PRECONDITION((!mFragmentMode), "Must not block in fragment mode.");
> >+  mBlocked = PR_TRUE;
> >+}
> 
> No need for the parenthesis here. That's a pattern you're using in a lot of
> assertions/preconditions

Fixed here. Should I also remove parentheses in cases like NS_ASSERTION((foo == bar), "blah");?

> >+nsHtml5Parser::Parse(nsIURI* aURL, // legacy parameter; ignored
> >+                     nsIRequestObserver* aObserver,
> >+                     void* aKey, 
> >+                     nsDTDMode aMode) // legacy; ignored
> >+{
> >+  mObserver = aObserver;
> >+  mRootContextKey = aKey;
> >+  mCanInterruptParser = PR_TRUE;
> >+  NS_ASSERTION((mLifeCycle == NOT_STARTED), "Tried to start parse without initializing the parser properly.");
> >+  return NS_OK;
> >+}
> 
> Seems like the assertion should be a precondition.

Fixed.

> >+NS_IMETHODIMP
> >+nsHtml5Parser::Parse(const nsAString& aSourceBuffer,
> >+                     void* aKey,
> >+                     const nsACString& aContentType, // ignored
> >+                     PRBool aLastCall,
> >+                     nsDTDMode aMode) // ignored
> >+{
> >+  NS_PRECONDITION((!mFragmentMode), "Document.write called in fragment mode!");
> >+  // Return early if the parser has processed EOF
> >+  switch (mLifeCycle) {
> >+    case TERMINATED:
> >+      return NS_OK;
> >+    case NOT_STARTED:
> >+      mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));
> >+      mTokenizer->start();
> >+      mLifeCycle = PARSING;
> >+      mParser = this;
> >+      mCharsetSource = kCharsetFromOtherComponent;
> >+      break;
> >+    default:
> >+      break;
> >+  }
> >+  
> >+  if (aLastCall && aSourceBuffer.IsEmpty() && aKey == GetRootContextKey()) {
> >+    // document.close()
> >+    mLifeCycle = STREAM_ENDING;
> >+    MaybePostContinueEvent();
> >+    return NS_OK;
> >+  }
> 
> Shouldn't this ensure to fire DOMContentLoaded and similar events?

No, there may still be unparsed data in the buffer chain if the caller did:
document.write("<script></script>trailing data");
document.close();

Even without trailing data, posting a continue event ensures that onload is fired from the event loop rather than from within a script.

> >+  // XXX stop speculative script thread here
> >+  
> >+  PRInt32 lineNumberSave = mTokenizer->getLineNumber();
> >+  
> >+  // Maintain a reference to ourselves so we don't go away
> >+  // till we're completely done.
> >+  // XXX is this still necessary? -- hsivonen
> >+  nsCOMPtr<nsIParser> kungFuDeathGrip(this);
> >+  
> >+  if (!aSourceBuffer.IsEmpty()) {
> >+    nsHtml5UTF16Buffer* buffer = new nsHtml5UTF16Buffer(aSourceBuffer.Length());
> >+    memcpy(buffer->getBuffer(), aSourceBuffer.BeginReading(), aSourceBuffer.Length() * sizeof(PRUnichar));
> >+    buffer->setEnd(aSourceBuffer.Length());
> >+    if (!mBlocked) {
> >+      WillResumeImpl();
> >+      WillParseImpl();
> >+      while (buffer->hasMore()) {
> >+        buffer->adjust(mLastWasCR);
> >+        mLastWasCR = PR_FALSE;
> >+        if (buffer->hasMore()) {
> >+          mLastWasCR = mTokenizer->tokenizeBuffer(buffer);
> >+          if (mScriptElement) {
> >+            // tree builder will have flushed
> >+            ExecuteScript();
> >+          }
> >+          if (mNeedsCharsetSwitch) {
> >+            // XXX setup immediate reparse
> >+            delete buffer;
> >+            WillInterruptImpl();
> >+            return NS_OK;
> 
> Can this really happen during a document.write?? Or rather, should it really be
> honored then?

I don't see why it wouldn't be honored per spec. Opera honors it. IE8, Gecko's old parser and WebKit don't. Maybe the spec should change here?

Demo:
http://hsivonen.iki.fi/test/moz/iso-8859-2-in-doc-write.htm

> >+          } else if (mBlocked) {
> >+            // XXX is the tail insertion and script exec in the wrong order?
> >+            WillInterruptImpl();
> >+            break;
> >+          } else {
> >+            // Ignore suspensions
> >+            continue;
> >+          }
> >+        }
> >+      }
> >+    }
> >+    if (buffer->hasMore()) {
> >+      nsHtml5UTF16Buffer* prevSearchBuf = nsnull;
> >+      nsHtml5UTF16Buffer* searchBuf = mFirstBuffer;
> >+      if (aKey) { // after document.open, the first level of document.write has null key
> >+        while (searchBuf != mLastBuffer) {
> >+          if (searchBuf->key == aKey) {
> >+            buffer->next = searchBuf;
> >+            if (prevSearchBuf) {
> >+              prevSearchBuf->next = buffer;
> >+            } else {
> >+              mFirstBuffer = buffer;
> >+            }
> >+            break;
> >+          }
> >+          prevSearchBuf = searchBuf;
> >+          searchBuf = searchBuf->next;
> >+        }
> >+      }
> >+      if (searchBuf == mLastBuffer || !aKey) {
> >+        // key was not found or we have a first-level write after document.open
> >+        // we'll insert to the head of the queue
> >+        nsHtml5UTF16Buffer* keyHolder = new nsHtml5UTF16Buffer(aKey);
> >+        keyHolder->next = mFirstBuffer;
> >+        buffer->next = keyHolder;
> >+        mFirstBuffer = buffer;
> >+      }
> >+      MaybePostContinueEvent();
> 
> Could you add a comment describing what the above is trying to do. I take it
> this is trying to deal with nested document.writes which is quite a complicated
> algorithm to wrap your head around, so comments would definitely help.

Comments added.

> >+nsHtml5Parser::Terminate(void)
> ...
> >+  PRBool ready = ReadyToCallDidBuildModelImpl(PR_TRUE);
> >+  NS_ASSERTION(ready, "Should always be ready to call DidBuildModel here.");
> >+  return DidBuildModel(); // nsIContentSink
> >+}
> 
> You need to make the declaration of |ready| be #ifdef DEBUG, otherwise
> non-debug builds will warn about an unused variable.

Fixed.

> >+NS_IMETHODIMP
> >+nsHtml5Parser::ParseFragment(const nsAString& aSourceBuffer,
> >+                             nsISupports* aTargetNode,
> >+                             nsIAtom* aContextLocalName,
> >+                             PRInt32 aContextNamespace,
> >+                             PRBool aQuirks)
> >+{
> >+  nsCOMPtr<nsIContent> target = do_QueryInterface(aTargetNode);
> >+  NS_ASSERTION(target, "Target did not QI to nsIContent");
> >+  mTreeBuilder->setFragmentContext(aContextLocalName, aContextNamespace, target, aQuirks);
> >+  mFragmentMode = PR_TRUE;
> >+  mCanInterruptParser = PR_FALSE;
> >+  NS_ASSERTION((mLifeCycle == NOT_STARTED), "Tried to start parse without initializing the parser properly.");
> >+  mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));
> 
> Does this matter for fragment parsing? Is it only for handling of <noscript>
> elements?

It's only for handling <noscript> elements. I'm not sure if the code can be reached if scripting is off. I assumed that an extension running as chrome could reach it even if the document has scripting turned off, so I put it in just in case.

> >+  mTokenizer->start();
> >+  mLifeCycle = PARSING;
> >+  mParser = this;
> >+  mNodeInfoManager = target->GetOwnerDoc()->NodeInfoManager();
> >+
> >+  if (!aSourceBuffer.IsEmpty()) {
> >+    PRBool lastWasCR = PR_FALSE;
> >+    nsHtml5UTF16Buffer buffer(aSourceBuffer.Length());
> >+    memcpy(buffer.getBuffer(), aSourceBuffer.BeginReading(), aSourceBuffer.Length() * sizeof(PRUnichar));
> >+    buffer.setEnd(aSourceBuffer.Length());
> >+    while (buffer.hasMore()) {
> >+      buffer.adjust(lastWasCR);
> >+      lastWasCR = PR_FALSE;
> >+      if (buffer.hasMore()) {
> >+        lastWasCR = mTokenizer->tokenizeBuffer(&buffer);
> >+        if (mScriptElement) {
> >+          nsCOMPtr<nsIScriptElement> script = do_QueryInterface(mScriptElement);
> >+          NS_ASSERTION(script, "mScriptElement didn't QI to nsIScriptElement!");
> >+          script->PreventExecution();
> >+          mScriptElement = nsnull;
> 
> Can't you deal with this when creating the script elements instead? To avoid
> having to break out of the tokenizer on every script element during innerHTML
> parsing.

I could, but then the non-fragment code path would run an additional branch to check if it is on the fragment path. It seems more consistent to run the same tree builder code and put the special case here.

> I guess it doesn't make that big of a difference perf-wise?

I'd expect this not to be a perf problem as written, because it only happens on </script> in innerHTML, which probably isn't too common. Also, this kind of falling of the fast track already happens for every carriage return.

> >+/**
> >+ *  This is called by the networking library once the last block of data
> >+ *  has been collected from the net.
> >+ */
> >+nsresult
> >+nsHtml5Parser::OnStopRequest(nsIRequest* aRequest, nsISupports* aContext,
> >+                        nsresult status)
> 
> Fix indentation

Fixed.

> >+{
> >+  mTreeBuilder->MaybeFlush();
> >+  NS_ASSERTION((mRequest == aRequest), "Got Stop on wrong stream.");
> >+  nsresult rv = NS_OK;
> >+  
> >+  if (!mUnicodeDecoder) {
> >+    PRUint32 writeCount;
> >+    rv = FinalizeSniffing(nsnull, 0, &writeCount, 0);
> >+  }
> >+  NS_ENSURE_SUCCESS(rv, rv);
> >+  
> >+  switch (mLifeCycle) {
> >+    case TERMINATED:
> >+      break;
> >+    case NOT_STARTED:
> >+      mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));
> >+      mTokenizer->start();
> >+      mLifeCycle = STREAM_ENDING;
> >+      mParser = this;
> >+      break;
> >+    case STREAM_ENDING:
> >+      NS_ERROR("OnStopRequest when the stream lifecycle was already ending.");
> >+      break;
> >+    default:
> >+      mLifeCycle = STREAM_ENDING;
> >+      break;
> >+  }
> >+
> >+//  if (eOnStart == mStreamListenerState) {
> >+    // If you're here, then OnDataAvailable() never got called.  Prior
> >+    // to necko, we never dealt with this case, but the problem may
> >+    // have existed.  Everybody can live with an empty input stream, so
> >+    // just resume parsing.
> >+//    ParseUntilSuspend();
> >+//  }
> >+//  mStreamStatus = status;
> >+
> >+//  if (mParserFilter)
> >+//    mParserFilter->Finish();
> 
> What's all this commented out code? And is that comment as old as it seems?
> "prior to necko"??

It's code I copied from nsParser to use as a starting point. Removed.

> >+  mStreamListenerState = eOnStop;
> >+
> >+  if (!mScriptsExecuting) {
> >+    ParseUntilSuspend();
> >+  }
> >+
> >+  // If the parser isn't enabled, we don't finish parsing till
> >+  // it is reenabled.
> 
> This comment doesn't seem to correspond to any code. Unless it's meant to refer
> to the code right before it?

Left-over from old nsParser code. Removed.
(Assignee)

Comment 64

9 years ago
Created attachment 384830 [details] [diff] [review]
Address review comments, fix crash with app cache, fix a test case
Attachment #384287 - Attachment is obsolete: true
Attachment #384830 - Flags: superreview?(jonas)
Attachment #384830 - Flags: review?(mrbkap)
Attachment #384287 - Flags: superreview?(jonas)
Attachment #384287 - Flags: review?(mrbkap)
(Assignee)

Comment 65

9 years ago
Created attachment 384831 [details] [diff] [review]
Interdiff of the previous and current patch
(I'll respond since these comments cover most of mine and because sicking is probably already asleep).

(In reply to comment #63)
> > Can you turn these into nsAutoPtr/nsAutoArrayPtrs instead?
> 
> That seems like an overkill, because these objects are only ever held from one
> place. If we want to avoid manual new/delete here, mTokenizer and mTreeBuilder
> could, in principle, be allocated as parts of the nsHtml5Parser object instead
> of hitting new/delete and having them behind a pointer.

If you can get away with that, I think that would be preferable. At the very least, I agree with sicking here -- especially with the Reset API on nsIParser, the potential for accidental leaks seems high.

FWIW, in hand-written C++ code, the prevailing style for the past few years (as long as I've been working on Gecko) has been to eschew manual refcounting in favor of nsRefPtr/nsAutoPtr with the assumption that compilers will generate decent code (and that humans forget to do stuff like that way more easily than compilers!).

> Copied from nsParser, but fixed. :-)

rickg = single entrance single exit. :-)

> Fixed here. Should I also remove parentheses in cases like NS_ASSERTION((foo ==
> bar), "blah");?

Yeah. In cases like this, it's up to the macro to do proper parenthesizing. Also, the parentheses will be distracting in the console if the asserts do hit.

> > Can this really happen during a document.write?? Or rather, should it really be
> > honored then?
> 
> I don't see why it wouldn't be honored per spec. Opera honors it. IE8, Gecko's
> old parser and WebKit don't. Maybe the spec should change here?

For what it's worth, we have code in the tree specifically to not honor this. I haven't done the CVS archeology yet to figure out why, though.

> It's only for handling <noscript> elements. I'm not sure if the code can be
> reached if scripting is off. I assumed that an extension running as chrome
> could reach it even if the document has scripting turned off, so I put it in
> just in case.

I'd expect this to run copy/pasting HTML into a noscript document, but I haven't tested.

> It's code I copied from nsParser to use as a starting point. Removed.

Blame me for not nuking it earlier (I did finally get rid of it in nsParser.cpp recently).
(Assignee)

Comment 67

9 years ago
Created attachment 384864 [details] [diff] [review]
Interdiff for making static jArray init safe, fix assert parens, include .java

I included .java files, and now the full patch is 2366 KB. Over the limit reed set. :-(
(Assignee)

Comment 68

9 years ago
(In reply to comment #47)
> There's a number of cases like this:
> 
> static jArray<PRUnichar,PRInt32> SCRIPT_ARR;
> 
> Unfortunately compilers aren't reliable enough at initializing static objects,
> so this won't work cross all platforms. Static variables that don't require
> constructors (such as a pointer to an array) is ok though.

Fixed.

(In reply to comment #61)
> This can land once reviewed. We want the Java code too, hosted sensibly.

Minimal .java files now added as javasrc/*.java.

> Under
> parser/html if you please -- DLL boundaries shouldn't stop us, we are the
> masters and they are the servants.

Maybe parser/html/ can be put into the layout DLL without moving parser/htmlparser/ for now. I'll try doing it that way.

(In reply to comment #66)
> (In reply to comment #63)
> > > Can you turn these into nsAutoPtr/nsAutoArrayPtrs instead?
> > 
> > That seems like an overkill, because these objects are only ever held from one
> > place. If we want to avoid manual new/delete here, mTokenizer and mTreeBuilder
> > could, in principle, be allocated as parts of the nsHtml5Parser object instead
> > of hitting new/delete and having them behind a pointer.
> 
> If you can get away with that, I think that would be preferable. At the very
> least, I agree with sicking here -- especially with the Reset API on nsIParser,
> the potential for accidental leaks seems high.

I got this almost done but didn't get it done all the way today. Still have an "incomplete type" error. I went for tree builder living inside tokenizer and tokenizer living inside parser.

> FWIW, in hand-written C++ code, the prevailing style for the past few years (as
> long as I've been working on Gecko) has been to eschew manual refcounting in
> favor of nsRefPtr/nsAutoPtr with the assumption that compilers will generate
> decent code (and that humans forget to do stuff like that way more easily than
> compilers!).

Going forward, I think refcounting stack nodes should go away and refcounting element nodes and atoms in the generated code should probably migrate to smart pointers.

> > Fixed here. Should I also remove parentheses in cases like NS_ASSERTION((foo ==
> > bar), "blah");?
> 
> Yeah. In cases like this, it's up to the macro to do proper parenthesizing.
> Also, the parentheses will be distracting in the console if the asserts do hit.

Fixed.
> > > Can this really happen during a document.write?? Or rather, should it
> > > really be honored then?
> > 
> > I don't see why it wouldn't be honored per spec. Opera honors it. IE8,
> > Gecko's old parser and WebKit don't. Maybe the spec should change here?
> 
> For what it's worth, we have code in the tree specifically to not honor this.
> I haven't done the CVS archeology yet to figure out why, though.

The case where it mostly seems weird to me to honor <meta>s in document.write is when someone is document.writing a whole document. In that case <meta> charsets makes no sense at all.

Given that most implementations are consistent here, and we do it on purpose, I'd say the spec should change yes.
>+// nsIContentSink
>+nsresult
>+nsHtml5Parser::WriteSniffingBufferAndCurrentSegment(const PRUint8* aFromSegment, // can be null
>+                                                    PRUint32 aCount,
>+                                                    PRUint32* aWriteCount)
>+{
>+  nsresult rv = NS_OK;
>+  if (mSniffingBuffer) {
>+    PRUint32 writeCount;
>+    rv = WriteStreamBytes(mSniffingBuffer, mSniffingLength, &writeCount);
>+    NS_ENSURE_SUCCESS(rv, rv);
>+    delete[] mSniffingBuffer;
>+    mSniffingBuffer = nsnull;
>+  }
>+  if (mMetaScanner) {
>+    delete mMetaScanner;
>+    mMetaScanner = nsnull;
>+  }

mSniffingBuffer and mMetaScanner looks like they can be nsAuto(Array)Ptrs too.

>+nsHtml5Parser::SniffStreamBytes(const PRUint8* aFromSegment,
>+                                PRUint32 aCount,
>+                                PRUint32* aWriteCount)
>+{
>+  nsresult rv = NS_OK;
>+  PRUint32 writeCount;
>+  for (PRUint32 i = 0; i < aCount; i++) {
>+    switch (mBomState) {
...
>+  }

I'm pretty sure we have BOM detecting code elsewhere in our code, could you perhaps reuse that? Not a huge deal I guess.

>+void
>+nsHtml5Parser::ParseUntilSuspend()
>+{
>+  NS_PRECONDITION((!mNeedsCharsetSwitch), "ParseUntilSuspend called when charset switch needed.");
>+  if (mBlocked) {
>+    return;
>+  }
>+  switch (mLifeCycle) {
>+    case TERMINATED:
>+      return;
>+    case NOT_STARTED:
>+      mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));
>+      mTokenizer->start();
>+      mLifeCycle = PARSING;
>+      mParser = this;
>+      break;
>+    default:
>+      break;
>+  }
>+  
>+  WillResumeImpl();
>+  WillParseImpl();
>+  
>+  mSuspending = PR_FALSE;
>+  for (;;) {
>+    if (!mFirstBuffer->hasMore()) {
>+      if (mFirstBuffer == mLastBuffer) {
>+        switch (mLifeCycle) {
>+          case PARSING:
>+            // never release the last buffer. instead just zero its indeces for refill
>+            mFirstBuffer->setStart(0);
>+            mFirstBuffer->setEnd(0);
>+            return; // no more data for now but expecting more
>+          case STREAM_ENDING:
>+            if (ReadyToCallDidBuildModelImpl(PR_FALSE)) {
>+              DidBuildModel();            
>+            } else {
>+              MaybePostContinueEvent();
>+            }

Won't this cause us to post continue events over and over (consuming 100% CPU) until ReadyToCallDidBuildModelImpl returns true (i.e. until all scripts have finished loading).

IIRC the current code instead continues parsing once the last script is doing executing. See
http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsContentSink.cpp#399

Testcases were submitted in bug 461555.


>+    if (mFirstBuffer->hasMore()) {
>+      mLastWasCR = mTokenizer->tokenizeBuffer(mFirstBuffer);
>+      if (mScriptElement) {
>+        ExecuteScript();
>+      }
>+      if (mNeedsCharsetSwitch) {
>+        if (PerformCharsetSwitch() == NS_ERROR_HTMLPARSER_STOPPARSING) {
>+          return;        
>+        } else {
>+          // let's continue if we failed to restart
>+          mNeedsCharsetSwitch = PR_FALSE;
>+        }
>+      }

Can both mScriptElement and mNeedsCharsetSwitch be set? If not, might be worth asserting that.

Also, should you perhaps do |continue;| after calling ExecuteScript()? Feels like all sorts of state can be caused by the script executing.

>+void
>+nsHtml5Parser::ExecuteScript()
>+{
>+  NS_PRECONDITION(mScriptElement, "Trying to run a script without having one!");
>+#ifdef DEBUG
>+  if (!mScriptsExecuting) {
>+    NS_ASSERTION(!mSnapshot, "Already had a state snapshot");
>+    mSnapshot = mTreeBuilder->newSnapshot();
>+  }
>+#endif

Might be worth using #ifdef GATHER_DOCWRITE_STATISTICS or something like that instead. So it's clear what the code is for.

>+  nsCOMPtr<nsIScriptElement> sele = do_QueryInterface(mScriptElement);
>+ 
>+   // Notify our document that we're loading this script.
>+  nsCOMPtr<nsIHTMLDocument> htmlDocument = do_QueryInterface(mDocument);
>+  NS_ASSERTION(htmlDocument, "Document didn't QI into HTML document.");
>+  htmlDocument->ScriptLoading(sele);
>+ 
>+   // Copied from nsXMLContentSink
>+  
>+  // Now tell the script that it's ready to go. This may execute the script
>+  // or return NS_ERROR_HTMLPARSER_BLOCK. Or neither if the script doesn't
>+  // need executing.
>+  nsresult rv = mScriptElement->DoneAddingChildren(PR_TRUE);

So per current HTML spec this doesn't do exactly the right thing. If a parent of the script element has been removed from the document, the script won't execute here. However the HTML spec says it should. There was a recent thread about this on the HTML list.

>+nsHtml5Parser::MaybePostContinueEvent()
>+{
>+  NS_PRECONDITION((mLifeCycle != TERMINATED), "Tried to post continue event when the parser is done.");
>+  if (mContinueEvent) {
>+    return; // we already have a pending event
>+  }
>+#if 0
>+  if ((!mTreeBuilder->NeedsFlush()) && (mStreamListenerState == eOnStart || mStreamListenerState == eOnDataAvail)) {
>+    return; // we are expecting stream events and we don't need a quick flush for the tree builder
>+  }
>+#endif

Why the #if 0?

>+  // This creates a reference cycle between this and the event that is
>+  // broken when the event fires.
>+  nsCOMPtr<nsIRunnable> event = new nsHtml5ParserContinueEvent(this);
>+  if (NS_FAILED(NS_DispatchToCurrentThread(event))) {
>+      NS_WARNING("failed to dispatch parser continuation event");
>+  } else {
>+      mContinueEvent = event;
>+  }

Probably worth keeping the event object around and reuse it multiple times.

>+void
>+nsHtml5Parser::ScriptExecuting()
>+{
>+  ++mScriptsExecuting;
>+}
>+
>+void
>+nsHtml5Parser::ScriptDidExecute()
>+{
>+  NS_ASSERTION(mScriptsExecuting > 0, "Too many calls to ScriptDidExecute");
>+  --mScriptsExecuting;
>+}

I think this API was removed from trunk.

>+PRBool
>+nsHtml5Parser::CanInterrupt()
>+{
>+  return !mFragmentMode;
>+}

This should also return false during document.writes, no?

>diff --git a/content/html/parser/src/nsHtml5Parser.h b/content/html/parser/src/nsHtml5Parser.h
>+enum eHtml5ParserLifecycle {
>+  NOT_STARTED = 0,
>+  PARSING = 1,
>+  STREAM_ENDING = 2,
>+  TERMINATED = 3
>+};

Please document these

>+enum eBomState {
>+  BOM_SNIFFING_NOT_STARTED = 0,
>+  SEEN_UTF_16_LE_FIRST_BYTE = 1,
>+  SEEN_UTF_16_BE_FIRST_BYTE = 2,
>+  SEEN_UTF_8_FIRST_BYTE = 3,
>+  SEEN_UTF_8_SECOND_BYTE = 4,
>+  BOM_SNIFFING_OVER = 5
>+};

And these (though these are more self evident)

>+class nsHtml5Parser : public nsIParser,

Documentation in general is pretty random/non-existant here. For functions from nsIParser/nsIContentSink no comment is needed since comments should be existing in nsIParser.h/nsIContentSink.h. Though it's great to just mentions which ones have a useless/no-op implementation. And no need to carry forward useless @update comments.

For functions that are new, please write complete documentation about what the function does and what the various arguments are.

>+    /**
>+     * Tells the parser that a script is now executing. The only data we
>+     * should resume parsing for is document.written data. We'll deal with any
>+     * data that comes in over the network later.
>+     */
>+    virtual void ScriptExecuting();
>+
>+    /**
>+     * Tells the parser that the script is done executing. We should now
>+     * continue the regular parsing process.
>+     */
>+    virtual void ScriptDidExecute();

These two functions are no longer on nsIParser

>+    // EncodingDeclarationHandler
>+    
>+    void internalEncodingDeclaration(nsString* aEncoding);
>+    
>+    // DocumentModeHandler
>+
>+    void documentMode(nsHtml5DocumentMode m);

Document these

>+    // Non-inherited methods
>+    nsresult FinalizeSniffing(const PRUint8* aFromSegment,
>+                              PRUint32 aCount,
>+                              PRUint32* aWriteCount,
>+                              PRUint32 aCountToSniffingLimit);
>+    nsresult SetupDecodingAndWriteSniffingBufferAndCurrentSegment(const PRUint8* aFromSegment,
>+                                                                  PRUint32 aCount,
>+                                                                  PRUint32* aWriteCount);
>+    nsresult WriteSniffingBufferAndCurrentSegment(const PRUint8* aFromSegment,
>+                                                  PRUint32 aCount,
>+                                                  PRUint32* aWriteCount);
>+    nsresult SetupDecodingFromBom(const char* aCharsetName, 
>+                                  const char* aDecoderCharsetName);
>+    nsresult SniffStreamBytes(const PRUint8* aFromSegment,
>+                              PRUint32 aCount,
>+                              PRUint32* aWriteCount);
>+    nsresult WriteStreamBytes(const PRUint8* aFromSegment,
>+                              PRUint32 aCount,
>+                              PRUint32* aWriteCount);
>+    void Suspend();
>+    void SetScriptElement(nsIContent* aScript);
>+    void UpdateStyleSheet(nsIContent* aElement);
>+    // Getters and setters for fields from nsContentSink 
>+    nsIDocument* GetDocument() {
>+      return mDocument;
>+    }
>+    nsNodeInfoManager* GetNodeInfoManager() {
>+      return mNodeInfoManager;
>+    }
>+    nsIDocShell* GetDocShell() {
>+      return mDocShell;
>+    }
>+    PRBool HasDecoder() {
>+      return !!mUnicodeDecoder;
>+    }
>+  private:
>+    void ExecuteScript();
>+    void MaybePostContinueEvent();
>+    nsresult PerformCharsetSwitch();
>+    /**
>+     * Parse until pending data is exhausted or tree builder suspends
>+     */
>+    void ParseUntilSuspend();
>+    void Cleanup();

Definitely document all of these. That'll help a lot once we do a more thorough review.

>+  private:
>+    // State variables
>+    PRBool                       mNeedsCharsetSwitch;
>+    PRBool                       mLastWasCR;
>+    PRBool                       mTerminated;
>+    PRBool                       mLayoutStarted;
>+    PRBool                       mFragmentMode;
>+    PRBool                       mBlocked;
>+    PRBool                       mSuspending;
>+    eHtml5ParserLifecycle        mLifeCycle;
>+    eStreamState                 mStreamListenerState;
>+
>+    // script execution
>+    nsCOMPtr<nsIContent>         mScriptElement;
>+    PRUint32                     mScriptsExecuting;
>+     
>+    // Gecko integration
>+    void*                        mRootContextKey;
>+    nsCOMPtr<nsIRequest>         mRequest; 
>+    nsCOMPtr<nsIRequestObserver> mObserver;
>+    nsIRunnable*                 mContinueEvent;  // weak ref
>+ 
>+    // tree-related stuff
>+    nsIContent*                  mDocElement; // weak ref 
>+  
>+    // encoding-related stuff
>+    PRInt32                      mCharsetSource;
>+    nsCString                    mCharset;
>+    nsCString                    mPendingCharset;
>+    nsCOMPtr<nsIUnicodeDecoder>  mUnicodeDecoder;
>+    PRUint8*                     mSniffingBuffer;
>+    PRUint32                     mSniffingLength; // number of meaningful bytes in mSniffingBuffer
>+    eBomState                    mBomState;
>+    nsHtml5MetaScanner*          mMetaScanner;
>+        
>+    // Portable parser objects    
>+    nsHtml5UTF16Buffer*          mFirstBuffer; // manually managed strong ref
>+    nsHtml5UTF16Buffer*          mLastBuffer; // weak ref; always points to 
>+                      // a buffer of the size NS_HTML5_PARSER_READ_BUFFER_SIZE
>+    nsHtml5TreeBuilder*          mTreeBuilder; // manually managed strong ref
>+    nsHtml5Tokenizer*            mTokenizer; // manually managed strong ref
>+#ifdef DEBUG
>+    nsHtml5StateSnapshot*        mSnapshot;
>+    static PRUint32              sUnsafeDocWrites;
>+    static PRUint32              sTokenSafeDocWrites;
>+    static PRUint32              sTreeSafeDocWrites;
>+#endif

And document any of these that aren't obvious.

>+#endif // nsHtml5PendingNotification_h__
>\ No newline at end of file

Please add newline. I think this happens at other places in the patch.
(In reply to comment #63)
> (In reply to comment #62)
> > >+NS_IMETHODIMP
> > >+nsHtml5Parser::Parse(const nsAString& aSourceBuffer,
> > >+                     void* aKey,
> > >+                     const nsACString& aContentType, // ignored
> > >+                     PRBool aLastCall,
> > >+                     nsDTDMode aMode) // ignored
> > >+{
> > >+  NS_PRECONDITION((!mFragmentMode), "Document.write called in fragment mode!");
> > >+  // Return early if the parser has processed EOF
> > >+  switch (mLifeCycle) {
> > >+    case TERMINATED:
> > >+      return NS_OK;
> > >+    case NOT_STARTED:
> > >+      mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));
> > >+      mTokenizer->start();
> > >+      mLifeCycle = PARSING;
> > >+      mParser = this;
> > >+      mCharsetSource = kCharsetFromOtherComponent;
> > >+      break;
> > >+    default:
> > >+      break;
> > >+  }
> > >+  
> > >+  if (aLastCall && aSourceBuffer.IsEmpty() && aKey == GetRootContextKey()) {
> > >+    // document.close()
> > >+    mLifeCycle = STREAM_ENDING;
> > >+    MaybePostContinueEvent();
> > >+    return NS_OK;
> > >+  }
> > 
> > Shouldn't this ensure to fire DOMContentLoaded and similar events?
> 
> No, there may still be unparsed data in the buffer chain if the caller did:
> document.write("<script></script>trailing data");
> document.close();
> 
> Even without trailing data, posting a continue event ensures that onload is
> fired from the event loop rather than from within a script.

This is a drive-by comment, and may not apply... Just posting an event may not be enough here since modal dialogs etc process events, so a nested document.write() that causes a continue event to be posted and then does alert("foo") could process that event from within the script.
>diff --git a/content/html/parser/src/nsHtml5Portability.cpp b/content/html/parser/src/nsHtml5Portability.cpp

>+nsHtml5Portability::newLocalNameFromBuffer(PRUnichar* buf, PRInt32 offset, PRInt32 length)
>+{
>+  // Optimization opportunity: make buf itself null-terminated
>+  PRUnichar* nullTerminated = new PRUnichar[length + 1];
>+  memcpy(nullTerminated,buf, length * sizeof(PRUnichar));
>+  nullTerminated[length] = 0;
>+  nsIAtom* rv = NS_NewAtom(nullTerminated);
>+  delete[] nullTerminated;
>+  return rv;
>+}

Are you forgetting to use the offset argument?

And a faster/simpler implementation is simply:

return NS_NewAtom(nsDependentSubstring(buf, length))

>+jArray<PRUnichar,PRInt32>
>+nsHtml5Portability::newCharArrayFromLocal(nsIAtom* local)
>+{
>+  nsAutoString temp;
>+  local->ToString(temp);
>+  PRInt32 len = temp.Length();
>+  jArray<PRUnichar,PRInt32> rv = jArray<PRUnichar,PRInt32>(len);
>+  memcpy(rv, temp.BeginReading(), len * sizeof(PRUnichar));
>+  return rv;  
>+}

This copies twice, should be doable to just copy once if you use CopyUTF16toUTF8 and nsIAtom::getUTF8String

>+PRBool
>+nsHtml5Portability::localEqualsBuffer(nsIAtom* local, PRUnichar* buf, PRInt32 offset, PRInt32 length)
>+{
>+  nsAutoString temp = nsAutoString(buf + offset, length);
>+  return local->Equals(temp);
>+}

local->Equals(nsDependentString(buf + offset, length));

>+PRBool
>+nsHtml5Portability::lowerCaseLiteralIsPrefixOfIgnoreAsciiCaseString(const char* lowerCaseLiteral, nsString* string)

lowerCaseLiteralIsPrefixOfStringIgnoreAsciiCase?

Please assert that lowerCaseLiteral is indeed lower case.

>+  if (!string) {
>+    return PR_FALSE;
>+  }
>+  const char* litPtr = lowerCaseLiteral;
>+  const PRUnichar* strPtr = string->BeginReading();
>+  const PRUnichar* end = string->EndReading();
>+  PRUnichar litChar;
>+  while ((litChar = (PRUnichar)*litPtr) && (strPtr < end)) {

Do you need the explicit cast here? It makes things much harder to read.

>+    PRUnichar strChar = *strPtr;
>+    if (strChar >= 'A' && strChar <= 'Z') {
>+      strChar += 0x20;
>+    }
>+    if (litChar != strChar) {
>+      return PR_FALSE;
>+    }
>+    ++litPtr;
>+    ++strPtr;
>+  }
>+  return PR_TRUE;
>+}

Won't this return true if string is a prefix of lowerCaseLiteral too? That seems surprising given the function name. I think you want to return strPtr != end at the end.

>+jArray<PRUnichar,PRInt32>
>+nsHtml5Portability::isIndexPrompt()
>+{
>+  // Yeah, this whole method is uncool
>+  char* literal = "This is a searchable index. Insert your search keywords here: ";

This should be translatable.
>diff --git a/content/html/parser/src/nsHtml5TreeBuilderCppSupplement.h b/content/html/parser/src/nsHtml5TreeBuilderCppSupplement.h
>+nsHtml5TreeBuilder::shallowClone(nsIContent* aElement)
>+{
>+  nsINode* clone;
>+  aElement->Clone(aElement->NodeInfo(), &clone);

ASSERT that clone->IsNodeOfType(nsINode::eCONTENT) returns true.

>+nsHtml5TreeBuilder::addAttributesToElement(nsIContent* aElement, nsHtml5HtmlAttributes* aAttributes)
>+{
>+  nsIContent* holder = createElement(kNameSpaceID_XHTML, nsHtml5Atoms::div, aAttributes);
>+  nsHtml5TreeOperation* treeOp = mOpQueue.AppendElement();
>+  // XXX if null, OOM!
>+  treeOp->Init(eTreeOpAddAttributes, holder, aElement);

Hum, this is very suboptimal. And we can't really do it off the main thread either. Couldn't you simply set the attributes on the real element for now? I guess I'm fine either way for now.

>+nsHtml5TreeBuilder::elementPopped(PRInt32 aNamespace, nsIAtom* aName, nsIContent* aElement)
>+{
>+  NS_ASSERTION((aNamespace == kNameSpaceID_XHTML || aNamespace == kNameSpaceID_SVG || aNamespace == kNameSpaceID_MathML), "Element isn't HTML, SVG or MathML!");
>+  NS_ASSERTION(aName, "Element doesn't have local name!");
>+  NS_ASSERTION(aElement, "No element!");
>+
>+  MaybeSuspend();  
>+  
>+  if (aNamespace == kNameSpaceID_MathML) {
>+    return;
>+  }  
>+  // we now have only SVG and HTML
>+  
>+  if (aName == nsHtml5Atoms::script) {
>+//    mConstrainSize = PR_TRUE; // XXX what is this?

mConstrainSize makes us not split text nodes into 4k chunks. Functionality-wise not needed since the scriptelement will join all textnodes if there's multiple. However it's unneccesary perf-wise to split the nodes up.

>+  if (aNamespace == kNameSpaceID_SVG) {
>+#if 0
>+    if (aElement->HasAttr(kNameSpaceID_None, nsHtml5Atoms::onload)) {
>+      Flush();
>+
>+      nsEvent event(PR_TRUE, NS_SVG_LOAD);
>+      event.eventStructType = NS_SVG_EVENT;
>+      event.flags |= NS_EVENT_FLAG_CANT_BUBBLE;
>+
>+      // Do we care about forcing presshell creation if it hasn't happened yet?
>+      // That is, should this code flush or something?  Does it really matter?
>+      // For that matter, do we really want to try getting the prescontext?  Does
>+      // this event ever want one?
>+      nsRefPtr<nsPresContext> ctx;
>+      nsCOMPtr<nsIPresShell> shell = mParser->GetDocument()->GetPrimaryShell();
>+      if (shell) {
>+        ctx = shell->GetPresContext();
>+      }
>+      nsEventDispatcher::Dispatch(aElement, ctx, &event);
>+    }
>+#endif

Need to fix this soon so we can get testing on SVG-in-HTML, though fine with initial landing without it.

>diff --git a/content/html/parser/src/nsHtml5TreeOperation.cpp b/content/html/parser/src/nsHtml5TreeOperation.cpp
>+nsHtml5TreeOperation::Perform(nsHtml5TreeBuilder* aBuilder)
>+    case eTreeOpAppendChildrenToNewParent: {
>+      aBuilder->FlushPendingAppendNotifications();
>+      PRUint32 childCount = mParent->GetChildCount();
>+      PRBool didAppend = PR_FALSE;
>+      while (mNode->GetChildCount()) {
>+        nsCOMPtr<nsIContent> child = mNode->GetChildAt(0);
>+        rv = mNode->RemoveChildAt(0, PR_TRUE);

Shouldn't you be adding |, PR_FALSE| here?
In general not all comments needs to be fixed before initial landing. Feel free to email or grab me on irc if there's something you'd prefer to fix later.

In general the most important thing is having the files live in the correct locations, and that there's more comments. I want to understand the code better as soon as possible, but in order for that to happen there needs to be more comments.

The other thing is ensuring that all files have proper line-endings, since that's much harder to catch once the files are in the tree.

Comment 75

9 years ago
+    static nsHtml5AttributeName* D;
+  D = new nsHtml5AttributeName(ALL_NO_NS, SAME_LOCAL(nsHtml5Atoms::d), ALL_NO_PREFIX);
+  delete D;
+  A = new nsHtml5ElementName(nsHtml5Atoms::a, nsHtml5Atoms::a, NS_HTML5TREE_BUILDER_A, PR_FALSE, PR_FALSE, PR_FALSE);
...

Can something please be done about the huge numbers of non-static non-const arrays and data being created in this patch?  These certainly aren't going to make malloc fall over, but looking at the code I can't understand how anyone would ever r+ it.


+PRUnichar nsHtml5NamedCharacters::NAME_0[] = {
+  'A', 'E', 'l', 'i', 'g'
+};

static const to make these read-only in the shared libraries?

+  NAMES[0] = J_ARRAY_STATIC(PRUnichar, PRInt32, NAME_0);

the jArray stuff makes me sad.  Can we just declare a big static const array with all this stuff in it?
(Assignee)

Comment 76

9 years ago
(In reply to comment #75)
> +    static nsHtml5AttributeName* D;
> +  D = new nsHtml5AttributeName(ALL_NO_NS, SAME_LOCAL(nsHtml5Atoms::d),
> ALL_NO_PREFIX);
> +  delete D;
> +  A = new nsHtml5ElementName(nsHtml5Atoms::a, nsHtml5Atoms::a,
> NS_HTML5TREE_BUILDER_A, PR_FALSE, PR_FALSE, PR_FALSE);
> ...
> 
> Can something please be done about the huge numbers of non-static non-const
> arrays and data being created in this patch?  These certainly aren't going to
> make malloc fall over, but looking at the code I can't understand how anyone
> would ever r+ it.

I wanted to put those into the data segment of the library first. Unfortunately, "static" atoms aren't properly static const. If "static" atoms were changed into having their addresses frozen at link time, I think it would make sense to revisit the pseudo-static approach with pre-interned element and attribute names.

> +PRUnichar nsHtml5NamedCharacters::NAME_0[] = {
> +  'A', 'E', 'l', 'i', 'g'
> +};
> 
> static const to make these read-only in the shared libraries?
> 
> +  NAMES[0] = J_ARRAY_STATIC(PRUnichar, PRInt32, NAME_0);

This is already fixed in the repo for stuff other than the named character tables (which are generated by a different generator code path that I haven't fixed yet). I wanted to address pending review comments more thoroughly before rediffing the patch.

> the jArray stuff makes me sad.  Can we just declare a big static const array
> with all this stuff in it?

Gecko defines string constants all over without rolling them all into one big array. If the backing memory of the arrays is static const, what would be the benefit of rolling all the backing arrays into one big one?
(Assignee)

Updated

9 years ago
Blocks: 500612
(Assignee)

Comment 77

9 years ago
(In reply to comment #69)
> > > > Can this really happen during a document.write?? Or rather, should it
> > > > really be honored then?
> > > 
> > > I don't see why it wouldn't be honored per spec. Opera honors it. IE8,
> > > Gecko's old parser and WebKit don't. Maybe the spec should change here?
> > 
> > For what it's worth, we have code in the tree specifically to not honor this.
> > I haven't done the CVS archeology yet to figure out why, though.
> 
> The case where it mostly seems weird to me to honor <meta>s in document.write
> is when someone is document.writing a whole document. In that case <meta>
> charsets makes no sense at all.

That's different. When the document has been created with document.open(), the encoding is "confident" to begin with and <meta charset> can't change it--document.written or not.

> Given that most implementations are consistent here, and we do it on purpose,
> I'd say the spec should change yes.

What about the case where part of the meta is document.written but the token ends in the network stream (the last '>') is in the network stream? Given how the charset switch is now written into the spec (and implemented), it would be annoying to have to taint the tag token as document.written in that case. (It's easier to check if the buffer that contains the closing '>' has been document.written.)

I made the parser ignore the charset meta if the '>' falls into synchronous document.write(). Filed bug 500611 about the rest.

(In reply to comment #70)
> >+// nsIContentSink
> >+nsresult
> >+nsHtml5Parser::WriteSniffingBufferAndCurrentSegment(const PRUint8* aFromSegment, // can be null
> >+                                                    PRUint32 aCount,
> >+                                                    PRUint32* aWriteCount)
> >+{
> >+  nsresult rv = NS_OK;
> >+  if (mSniffingBuffer) {
> >+    PRUint32 writeCount;
> >+    rv = WriteStreamBytes(mSniffingBuffer, mSniffingLength, &writeCount);
> >+    NS_ENSURE_SUCCESS(rv, rv);
> >+    delete[] mSniffingBuffer;
> >+    mSniffingBuffer = nsnull;
> >+  }
> >+  if (mMetaScanner) {
> >+    delete mMetaScanner;
> >+    mMetaScanner = nsnull;
> >+  }
> 
> mSniffingBuffer and mMetaScanner looks like they can be nsAuto(Array)Ptrs too.

Made them into such. Also made mTokenizer and mTreeBuilder into nsAutoPtrs. Trying to put the classes inside one another became annoying with circular dependencies in inlined methods. And there was a risk of having to undo some of it in the off-the-main-thread world. (Also, in my "overkill" comment earlier, I confused nsAutoPtrs and nsRefPtrs.)

> >+nsHtml5Parser::SniffStreamBytes(const PRUint8* aFromSegment,
> >+                                PRUint32 aCount,
> >+                                PRUint32* aWriteCount)
> >+{
> >+  nsresult rv = NS_OK;
> >+  PRUint32 writeCount;
> >+  for (PRUint32 i = 0; i < aCount; i++) {
> >+    switch (mBomState) {
> ...
> >+  }
> 
> I'm pretty sure we have BOM detecting code elsewhere in our code, could you
> perhaps reuse that? Not a huge deal I guess.

It seems complicated and invasive to plug external code here, considering that the BOM sniffing is incremental and succeeds with minimal buffering here.

> >+void
> >+nsHtml5Parser::ParseUntilSuspend()
> >+{
> >+  NS_PRECONDITION((!mNeedsCharsetSwitch), "ParseUntilSuspend called when charset switch needed.");
> >+  if (mBlocked) {
> >+    return;
> >+  }
> >+  switch (mLifeCycle) {
> >+    case TERMINATED:
> >+      return;
> >+    case NOT_STARTED:
> >+      mTreeBuilder->setScriptingEnabled(IsScriptEnabled(mDocument, mDocShell));
> >+      mTokenizer->start();
> >+      mLifeCycle = PARSING;
> >+      mParser = this;
> >+      break;
> >+    default:
> >+      break;
> >+  }
> >+  
> >+  WillResumeImpl();
> >+  WillParseImpl();
> >+  
> >+  mSuspending = PR_FALSE;
> >+  for (;;) {
> >+    if (!mFirstBuffer->hasMore()) {
> >+      if (mFirstBuffer == mLastBuffer) {
> >+        switch (mLifeCycle) {
> >+          case PARSING:
> >+            // never release the last buffer. instead just zero its indeces for refill
> >+            mFirstBuffer->setStart(0);
> >+            mFirstBuffer->setEnd(0);
> >+            return; // no more data for now but expecting more
> >+          case STREAM_ENDING:
> >+            if (ReadyToCallDidBuildModelImpl(PR_FALSE)) {
> >+              DidBuildModel();            
> >+            } else {
> >+              MaybePostContinueEvent();
> >+            }
> 
> Won't this cause us to post continue events over and over (consuming 100% CPU)
> until ReadyToCallDidBuildModelImpl returns true (i.e. until all scripts have
> finished loading).

It most likely does.

> IIRC the current code instead continues parsing once the last script is doing
> executing. See
> http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsContentSink.cpp#399

I removed the else branch.

> Testcases were submitted in bug 461555.

The general premise of the bug seems wrong per HTML5 spec and the behavior of IE. (Per spec, document.write from bad places blows away the document. IE also blows away the document.)

> >+    if (mFirstBuffer->hasMore()) {
> >+      mLastWasCR = mTokenizer->tokenizeBuffer(mFirstBuffer);
> >+      if (mScriptElement) {
> >+        ExecuteScript();
> >+      }
> >+      if (mNeedsCharsetSwitch) {
> >+        if (PerformCharsetSwitch() == NS_ERROR_HTMLPARSER_STOPPARSING) {
> >+          return;        
> >+        } else {
> >+          // let's continue if we failed to restart
> >+          mNeedsCharsetSwitch = PR_FALSE;
> >+        }
> >+      }
> 
> Can both mScriptElement and mNeedsCharsetSwitch be set? If not, might be worth
> asserting that.

They can't. Added assert.

> Also, should you perhaps do |continue;| after calling ExecuteScript()? Feels
> like all sorts of state can be caused by the script executing.

It's important that the code runs through if (mBlocked) afterwards. It'll hit continue a bit later if neither mBlocked nor mSuspend in true.

> >+void
> >+nsHtml5Parser::ExecuteScript()
> >+{
> >+  NS_PRECONDITION(mScriptElement, "Trying to run a script without having one!");
> >+#ifdef DEBUG
> >+  if (!mScriptsExecuting) {
> >+    NS_ASSERTION(!mSnapshot, "Already had a state snapshot");
> >+    mSnapshot = mTreeBuilder->newSnapshot();
> >+  }
> >+#endif
> 
> Might be worth using #ifdef GATHER_DOCWRITE_STATISTICS or something like that
> instead. So it's clear what the code is for.

Done.

> >+  nsCOMPtr<nsIScriptElement> sele = do_QueryInterface(mScriptElement);
> >+ 
> >+   // Notify our document that we're loading this script.
> >+  nsCOMPtr<nsIHTMLDocument> htmlDocument = do_QueryInterface(mDocument);
> >+  NS_ASSERTION(htmlDocument, "Document didn't QI into HTML document.");
> >+  htmlDocument->ScriptLoading(sele);
> >+ 
> >+   // Copied from nsXMLContentSink
> >+  
> >+  // Now tell the script that it's ready to go. This may execute the script
> >+  // or return NS_ERROR_HTMLPARSER_BLOCK. Or neither if the script doesn't
> >+  // need executing.
> >+  nsresult rv = mScriptElement->DoneAddingChildren(PR_TRUE);
> 
> So per current HTML spec this doesn't do exactly the right thing. If a parent
> of the script element has been removed from the document, the script won't
> execute here. However the HTML spec says it should. There was a recent thread
> about this on the HTML list.

It seems hard to fix this and to deal with bitrot of this patch at the same time when the old and new parser share script execution code. I suggest landing this first and then fixing script execution HTML5 correctness as a follow-up bug. Filed bug 500612.

> >+nsHtml5Parser::MaybePostContinueEvent()
> >+{
> >+  NS_PRECONDITION((mLifeCycle != TERMINATED), "Tried to post continue event when the parser is done.");
> >+  if (mContinueEvent) {
> >+    return; // we already have a pending event
> >+  }
> >+#if 0
> >+  if ((!mTreeBuilder->NeedsFlush()) && (mStreamListenerState == eOnStart || mStreamListenerState == eOnDataAvail)) {
> >+    return; // we are expecting stream events and we don't need a quick flush for the tree builder
> >+  }
> >+#endif
> 
> Why the #if 0?

Should have been removed before. Removed now.

> >+  // This creates a reference cycle between this and the event that is
> >+  // broken when the event fires.
> >+  nsCOMPtr<nsIRunnable> event = new nsHtml5ParserContinueEvent(this);
> >+  if (NS_FAILED(NS_DispatchToCurrentThread(event))) {
> >+      NS_WARNING("failed to dispatch parser continuation event");
> >+  } else {
> >+      mContinueEvent = event;
> >+  }
> 
> Probably worth keeping the event object around and reuse it multiple times.

Deferred as bug 500615 per IRC discussion.

> >+void
> >+nsHtml5Parser::ScriptExecuting()
> >+{
> >+  ++mScriptsExecuting;
> >+}
> >+
> >+void
> >+nsHtml5Parser::ScriptDidExecute()
> >+{
> >+  NS_ASSERTION(mScriptsExecuting > 0, "Too many calls to ScriptDidExecute");
> >+  --mScriptsExecuting;
> >+}
> 
> I think this API was removed from trunk.

I think this has come and gone during the development of this patch. :-( Fixed.

> >+PRBool
> >+nsHtml5Parser::CanInterrupt()
> >+{
> >+  return !mFragmentMode;
> >+}
> 
> This should also return false during document.writes, no?

In principle, yes. Fixed.

However, it seems that it might as well return true always, because the parser won't honor the interrupt anyway during doc.write or in the fragment case. Not returning true seems to save a break of out the tokenizer loop, though.

> >diff --git a/content/html/parser/src/nsHtml5Parser.h b/content/html/parser/src/nsHtml5Parser.h
> >+enum eHtml5ParserLifecycle {
> >+  NOT_STARTED = 0,
> >+  PARSING = 1,
> >+  STREAM_ENDING = 2,
> >+  TERMINATED = 3
> >+};
> 
> Please document these

Documented.

> >+enum eBomState {
> >+  BOM_SNIFFING_NOT_STARTED = 0,
> >+  SEEN_UTF_16_LE_FIRST_BYTE = 1,
> >+  SEEN_UTF_16_BE_FIRST_BYTE = 2,
> >+  SEEN_UTF_8_FIRST_BYTE = 3,
> >+  SEEN_UTF_8_SECOND_BYTE = 4,
> >+  BOM_SNIFFING_OVER = 5
> >+};
> 
> And these (though these are more self evident)

Documented.

> >+class nsHtml5Parser : public nsIParser,
> 
> Documentation in general is pretty random/non-existant here. For functions from
> nsIParser/nsIContentSink no comment is needed since comments should be existing
> in nsIParser.h/nsIContentSink.h. Though it's great to just mentions which ones
> have a useless/no-op implementation. And no need to carry forward useless
> @update comments.
> 
> For functions that are new, please write complete documentation about what the
> function does and what the various arguments are.

Documentation in progress.

> >+    /**
> >+     * Tells the parser that a script is now executing. The only data we
> >+     * should resume parsing for is document.written data. We'll deal with any
> >+     * data that comes in over the network later.
> >+     */
> >+    virtual void ScriptExecuting();
> >+
> >+    /**
> >+     * Tells the parser that the script is done executing. We should now
> >+     * continue the regular parsing process.
> >+     */
> >+    virtual void ScriptDidExecute();
> 
> These two functions are no longer on nsIParser

Removed.

> >+#endif // nsHtml5PendingNotification_h__
> >\ No newline at end of file
> 
> Please add newline. I think this happens at other places in the patch.

Made thorough whitespace readjustmets.

(In reply to comment #71)
> (In reply to comment #63)
> > No, there may still be unparsed data in the buffer chain if the caller did:
> > document.write("<script></script>trailing data");
> > document.close();
> > 
> > Even without trailing data, posting a continue event ensures that onload is
> > fired from the event loop rather than from within a script.
> 
> This is a drive-by comment, and may not apply... Just posting an event may not
> be enough here since modal dialogs etc process events, so a nested
> document.write() that causes a continue event to be posted and then does
> alert("foo") could process that event from within the script.

I hadn't thought of that case. However, if it is a problem, the old parser most likely has the problem, too. I filed bug 500616 just in case.

(In reply to comment #72)
> >diff --git a/content/html/parser/src/nsHtml5Portability.cpp b/content/html/parser/src/nsHtml5Portability.cpp
> 
> >+nsHtml5Portability::newLocalNameFromBuffer(PRUnichar* buf, PRInt32 offset, PRInt32 length)
> >+{
> >+  // Optimization opportunity: make buf itself null-terminated
> >+  PRUnichar* nullTerminated = new PRUnichar[length + 1];
> >+  memcpy(nullTerminated,buf, length * sizeof(PRUnichar));
> >+  nullTerminated[length] = 0;
> >+  nsIAtom* rv = NS_NewAtom(nullTerminated);
> >+  delete[] nullTerminated;
> >+  return rv;
> >+}
> 
> Are you forgetting to use the offset argument?

The offset should always be 0 here; filed bug 500617.

> And a faster/simpler implementation is simply:
> 
> return NS_NewAtom(nsDependentSubstring(buf, length))

Fixed.

> >+jArray<PRUnichar,PRInt32>
> >+nsHtml5Portability::newCharArrayFromLocal(nsIAtom* local)
> >+{
> >+  nsAutoString temp;
> >+  local->ToString(temp);
> >+  PRInt32 len = temp.Length();
> >+  jArray<PRUnichar,PRInt32> rv = jArray<PRUnichar,PRInt32>(len);
> >+  memcpy(rv, temp.BeginReading(), len * sizeof(PRUnichar));
> >+  return rv;  
> >+}
> 
> This copies twice, should be doable to just copy once if you use
> CopyUTF16toUTF8 and nsIAtom::getUTF8String

How would I get the required UTF-16 length in that case?

> >+PRBool
> >+nsHtml5Portability::localEqualsBuffer(nsIAtom* local, PRUnichar* buf, PRInt32 offset, PRInt32 length)
> >+{
> >+  nsAutoString temp = nsAutoString(buf + offset, length);
> >+  return local->Equals(temp);
> >+}
> 
> local->Equals(nsDependentString(buf + offset, length));

Fixed.

> >+PRBool
> >+nsHtml5Portability::lowerCaseLiteralIsPrefixOfIgnoreAsciiCaseString(const char* lowerCaseLiteral, nsString* string)
> 
> lowerCaseLiteralIsPrefixOfStringIgnoreAsciiCase?

At least it says what it does. :-)

> Please assert that lowerCaseLiteral is indeed lower case.

Fixed.

> >+  if (!string) {
> >+    return PR_FALSE;
> >+  }
> >+  const char* litPtr = lowerCaseLiteral;
> >+  const PRUnichar* strPtr = string->BeginReading();
> >+  const PRUnichar* end = string->EndReading();
> >+  PRUnichar litChar;
> >+  while ((litChar = (PRUnichar)*litPtr) && (strPtr < end)) {
> 
> Do you need the explicit cast here? It makes things much harder to read.

Removed.

> >+    PRUnichar strChar = *strPtr;
> >+    if (strChar >= 'A' && strChar <= 'Z') {
> >+      strChar += 0x20;
> >+    }
> >+    if (litChar != strChar) {
> >+      return PR_FALSE;
> >+    }
> >+    ++litPtr;
> >+    ++strPtr;
> >+  }
> >+  return PR_TRUE;
> >+}
> 
> Won't this return true if string is a prefix of lowerCaseLiteral too? That
> seems surprising given the function name. I think you want to return strPtr !=
> end at the end.

Fixed but in a way that make a string a prefix of itself.

> >+jArray<PRUnichar,PRInt32>
> >+nsHtml5Portability::isIndexPrompt()
> >+{
> >+  // Yeah, this whole method is uncool
> >+  char* literal = "This is a searchable index. Insert your search keywords here: ";
> 
> This should be translatable.

Filed bug 500631.

(In reply to comment #73)
> >diff --git a/content/html/parser/src/nsHtml5TreeBuilderCppSupplement.h b/content/html/parser/src/nsHtml5TreeBuilderCppSupplement.h
> >+nsHtml5TreeBuilder::shallowClone(nsIContent* aElement)
> >+{
> >+  nsINode* clone;
> >+  aElement->Clone(aElement->NodeInfo(), &clone);
> 
> ASSERT that clone->IsNodeOfType(nsINode::eCONTENT) returns true.

Fixed. (This whole method is going away in due course.)

> >+nsHtml5TreeBuilder::addAttributesToElement(nsIContent* aElement, nsHtml5HtmlAttributes* aAttributes)
> >+{
> >+  nsIContent* holder = createElement(kNameSpaceID_XHTML, nsHtml5Atoms::div, aAttributes);
> >+  nsHtml5TreeOperation* treeOp = mOpQueue.AppendElement();
> >+  // XXX if null, OOM!
> >+  treeOp->Init(eTreeOpAddAttributes, holder, aElement);
> 
> Hum, this is very suboptimal. And we can't really do it off the main thread
> either. Couldn't you simply set the attributes on the real element for now? I
> guess I'm fine either way for now.

I left this as is, because this will have to change in the off-the-main-thread world anyway.

> >+nsHtml5TreeBuilder::elementPopped(PRInt32 aNamespace, nsIAtom* aName, nsIContent* aElement)
> >+{
> >+  NS_ASSERTION((aNamespace == kNameSpaceID_XHTML || aNamespace == kNameSpaceID_SVG || aNamespace == kNameSpaceID_MathML), "Element isn't HTML, SVG or MathML!");
> >+  NS_ASSERTION(aName, "Element doesn't have local name!");
> >+  NS_ASSERTION(aElement, "No element!");
> >+
> >+  MaybeSuspend();  
> >+  
> >+  if (aNamespace == kNameSpaceID_MathML) {
> >+    return;
> >+  }  
> >+  // we now have only SVG and HTML
> >+  
> >+  if (aName == nsHtml5Atoms::script) {
> >+//    mConstrainSize = PR_TRUE; // XXX what is this?
> 
> mConstrainSize makes us not split text nodes into 4k chunks. Functionality-wise
> not needed since the scriptelement will join all textnodes if there's multiple.
> However it's unneccesary perf-wise to split the nodes up.

Removed comment.

> >+  if (aNamespace == kNameSpaceID_SVG) {
> >+#if 0
> >+    if (aElement->HasAttr(kNameSpaceID_None, nsHtml5Atoms::onload)) {
> >+      Flush();
> >+
> >+      nsEvent event(PR_TRUE, NS_SVG_LOAD);
> >+      event.eventStructType = NS_SVG_EVENT;
> >+      event.flags |= NS_EVENT_FLAG_CANT_BUBBLE;
> >+
> >+      // Do we care about forcing presshell creation if it hasn't happened yet?
> >+      // That is, should this code flush or something?  Does it really matter?
> >+      // For that matter, do we really want to try getting the prescontext?  Does
> >+      // this event ever want one?
> >+      nsRefPtr<nsPresContext> ctx;
> >+      nsCOMPtr<nsIPresShell> shell = mParser->GetDocument()->GetPrimaryShell();
> >+      if (shell) {
> >+        ctx = shell->GetPresContext();
> >+      }
> >+      nsEventDispatcher::Dispatch(aElement, ctx, &event);
> >+    }
> >+#endif
> 
> Need to fix this soon so we can get testing on SVG-in-HTML, though fine with
> initial landing without it.

I thought the plan was to try to change this on the XML side as well. If the event really needs to fire in sync, it'll have to break out of the tokenizer loop here and inhibit document.write.

> >diff --git a/content/html/parser/src/nsHtml5TreeOperation.cpp b/content/html/parser/src/nsHtml5TreeOperation.cpp
> >+nsHtml5TreeOperation::Perform(nsHtml5TreeBuilder* aBuilder)
> >+    case eTreeOpAppendChildrenToNewParent: {
> >+      aBuilder->FlushPendingAppendNotifications();
> >+      PRUint32 childCount = mParent->GetChildCount();
> >+      PRBool didAppend = PR_FALSE;
> >+      while (mNode->GetChildCount()) {
> >+        nsCOMPtr<nsIContent> child = mNode->GetChildAt(0);
> >+        rv = mNode->RemoveChildAt(0, PR_TRUE);
> 
> Shouldn't you be adding |, PR_FALSE| here?

Yes. Fixed.

I'll add more documentation before attaching a new patch.
(Assignee)

Comment 78

9 years ago
(In reply to comment #75)
> +PRUnichar nsHtml5NamedCharacters::NAME_0[] = {
> +  'A', 'E', 'l', 'i', 'g'
> +};
> 
> static const to make these read-only in the shared libraries?

Made these static const.
(Assignee)

Comment 79

9 years ago
Created attachment 385389 [details] [diff] [review]
Patch with .java files omitted
Attachment #384830 - Attachment is obsolete: true
Attachment #385389 - Flags: superreview?(jonas)
Attachment #385389 - Flags: review?(mrbkap)
Attachment #384830 - Flags: superreview?(jonas)
Attachment #384830 - Flags: review?(mrbkap)
(Assignee)

Comment 80

9 years ago
Created attachment 385393 [details] [diff] [review]
Interdiff form 2009-06-24 to 2009-06-26

Pushed to try at rev bb03b52176c3.

sicking, over to you. I have no unpushed local changes.
Attachment #384864 - Attachment is obsolete: true
(Assignee)

Comment 81

9 years ago
Pushed as rev 168dd0eb7283. Woohoo! Thank you for the reviews and for all the other help!
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
(Assignee)

Comment 82

9 years ago
Created attachment 385640 [details] [diff] [review]
Contingency patch for making the HTML5 run no code

Since it's impractical to back out a push of this size if things go wrong on the tinderboxen, I'm attaching a patch that keeps the parser in the build but makes it run no code (not even its static initialization which is otherwise run even when the parser is preffed off).

Comment 83

9 years ago
"SEC" is defined as 1 in /usr/include/sys/time.h on Solaris.

Can we rename that?

Error message:
nsHtml5ElementName.h", line 153: Error: Identifier expected instead of "1".
(Assignee)

Comment 84

9 years ago
Yes, it will be renamed.
(Assignee)

Comment 85

9 years ago
The Solaris issue is now bug 501091.
(Assignee)

Comment 86

9 years ago
(In reply to comment #81)
> Pushed as rev 168dd0eb7283.

For the record, that rev shows the merge in the other direction which is illogical. The relanding was:
http://hg.mozilla.org/mozilla-central/rev/8b5103cb12a6
http://hg.mozilla.org/mozilla-central/rev/1811499852cc
http://hg.mozilla.org/mozilla-central/rev/46448d830d4a
Is it useful to file bugs on the new parser yet? When turning the pref on, I seem to crash on virtually every page.
Yes, please do file bugs asap. The reason we wanted it landed was to get testing.
Depends on: 501113
Attachment #385389 - Flags: superreview?(jonas)
Attachment #385389 - Flags: review?(mrbkap)
Alias: html5-parsing-land
It turns out that the fix for bug 501113 has fixed these crashes.
Depends on: 502091
Depends on: 502103
(Assignee)

Updated

9 years ago
No longer depends on: 502103
I've been tasked with devising/proposing a plan to incorporate the java source files into mozilla-central.  As a start, I've posted a patch that adds some new export scripts to the validator.nu htmlparser repository:

http://bugzilla.validator.nu/show_bug.cgi?id=592

The high bit is that we need to include, at minimum, some sort of standalone version of the Java-to-C++ translator so that mozilla developers can hack on the java sources and easily regenerate the corresponding C++ files.

Henri, please let me know if there's anything in the patch you don't like, and feel free to modify it as you see fit.  After all, it is your repo.

Updated

9 years ago
Depends on: 502984
(Assignee)

Updated

9 years ago
No longer depends on: 502984
Created attachment 387588 [details] [diff] [review]
Patch to create Makefiles with sync/libs/translate/clean targets

I had a realization today.  All the HTML5 parsing-related java files live in Subversion repositories, and one nice thing about Subversion is that it allows you to check out an arbitrary subdirectory of a respository.  So, instead of exporting the java source files from some external svn working directory into the mozilla-central source tree, it makes a lot more sense to pull just the necessary directories directly from the appropriate repository.

This patch by itself does not contain the java files (it would have been almost 2MB if it did), but if you read parser/html/java/README.txt you'll see that it's easy to fetch them (make sync), build the translator (make), and perform the translation (make translate).  The 'sync' target fetches not only the necessary java sources but also the appropriate license files.

This does introduce a Subversion dependency for those who want to hack on the HTML5 parser, but only for them.  Building mozilla-central is still possible without Subversion, since these Makefiles are independent of the existing build system.  At some point we may decide to eliminate the Subversion dependency by committing the java sources to mozilla-central, but I'll leave that for a future code review.
Attachment #387588 - Flags: superreview?(jst)
Attachment #387588 - Flags: review?(jst)
Attachment #387588 - Flags: review?(jst) → review?(hsivonen)
Doesn't this mean that if that subversion repo were to go away, we'd be violating the GPL and LGPL?
Comment on attachment 387588 [details] [diff] [review]
Patch to create Makefiles with sync/libs/translate/clean targets

>+# The Initial Developer of the Original Code is
>+#   Ben Newman <b{enjam,newma}n@mozilla.com>
>+#
>+# Portions created by the Initial Developer are Copyright (C) 2009
>+# the Initial Developer. All Rights Reserved.
>+#
>+# Contributor(s):

No, the initial developer of the original code is Mozilla Corporation (or Mozilla Foundation), not you. You should list yourself under Contributor(s).
(Assignee)

Comment 94

9 years ago
(In reply to comment #91)
> I had a realization today.  All the HTML5 parsing-related java files live in
> Subversion repositories, and one nice thing about Subversion is that it allows
> you to check out an arbitrary subdirectory of a respository.  So, instead of
> exporting the java source files from some external svn working directory into
> the mozilla-central source tree, it makes a lot more sense to pull just the
> necessary directories directly from the appropriate repository.

I'd prefer to get rid of Subversion. The reason I haven't properly pursued getting rid of Subversion, yet, is that I've focused on working on the parser.

Subversion is a pain with linear history, the committer policy issues of a central repo and mysterious error messages when Debian packages of svn binaries are horked. I'd like to migrate the repo from svn to hg to make some of the pain go away. Specifically, I'd like to make it possible for people who want to port the parser to non-Gecko C++ environments to work in a non-linear fashion and without having to set up commit access to the CVSDude repo from get-go. (There's interest in porting it to C++ STL as opposed to NSPR and porting in to CPython and C Ruby library back ends.)

(I don't think it makes sense to make m-c the primary hg repo hosting the .java files for non-Gecko uses, because m-c is huge and comes with its own policy issues for people who want to work on STL, Python or Ruby targets. I don't know if there is a way to snapshot files from another hg repo to m-c without losing history.)

(In reply to comment #92)
> Doesn't this mean that if that subversion repo were to go away, we'd be
> violating the GPL and LGPL?

As I understand it, Mozilla wouldn't be if it packaged binaries under the MPL, but GPL-using downstreams would be. The Java files corresponding to the generated C++ files should be in m-c to help GPL downstreams comply. (However, they should probably move near the current directory location of the C++ files--the .java files are still under content/html/parser/javasrc/.)

If the translator binary and javaparser.jar go into m-c, the translator source and the javaparser source should go in, too, in order to comply with LGPL for javaparser.jar and the translator bits derived from javaparser.jar sources. However, per licensing at mozilla.org, the binary-only Google Code Ant plug-in should be zapped from the javaparser source drop first.

That said, I'd very much prefer the m-c copies of any .java files being treated as (L)GPL compliance devices and as aids for urgent changes within m-c without external deps. I think it would be good not to fork the .java files for further and to instead work from the parser repo and snapshotting to m-c if at all possible.

The setup I've had is that I work in a sandbox checked out from SVN and when I run the translator, the translator writes into my m-c check-out sandbox and also snapshots the relevant .java files there (but not the translator itself currently).

There's bug 499141 for landing the translator. I suggest we continue there.

(I concur with reed on the license header point.)
If the Java and the derived C are both under the BSD licence, we have no problem with GPL issues. We are offering them to other people under those licences. If other people take them and bind them into a GPL app, they then have to make sure they ship the preferred form of modification to their customers, but that's not our problem.

However, we should make the life of those people as easy as possible, which Henri seems to be trying to do. Although it would be better if the Java sources were also in our repo.

Gerv
(In reply to comment #94)
> (I don't think it makes sense to make m-c the primary hg repo hosting the .java
> files for non-Gecko uses, because m-c is huge and comes with its own policy
> issues for people who want to work on STL, Python or Ruby targets.
OT, but would http://projectkenai.com/ be a useful choice for off m-c porject hosting?

(In reply to comment #94)
> I don't know if there is a way to snapshot files from another hg repo to m-c without losing history.)
Forests/fsnap
(In reply to comment #93)
> No, the initial developer of the original code is Mozilla Corporation (or
> Mozilla Foundation), not you. You should list yourself under Contributor(s).

In that case, someone who knows about these things should update this page:
https://developer.mozilla.org/En/Makefile.in
(In reply to comment #92)
> Doesn't this mean that if that subversion repo were to go away, we'd be
> violating the GPL and LGPL?

Even if that's the case, it shouldn't be a problem as long as we check the code into m-c before the svn repo goes away, which I have no doubt is going to happen (soon).  I held off including the java files mostly in order to keep this patch reviewable (not to mention bugzilla-attachable).  There will definitely be a follow-up commit that adds the java files.

(In reply to comment #96)
> (In reply to comment #94)
> > I don't know if there is a way to snapshot files from another hg repo to m-c without losing history.)
> Forests/fsnap

fwiw: http://mercurial.selenic.com/wiki/NestedRepositories
> In that case, someone who knows about these things should update this page:

Done.
(In reply to comment #98)
> it shouldn't be a problem as long as we check the code
> into m-c before the svn repo goes away, which I have no doubt is going to
> happen (soon).

The checking in, not the going away!  Note to self, stay away from which.
Comment on attachment 387588 [details] [diff] [review]
Patch to create Makefiles with sync/libs/translate/clean targets

Moved my translator check-in patch to a more appropriate bug:
https://bugzilla.mozilla.org/attachment.cgi?id=388568&action=edit
Attachment #387588 - Attachment is obsolete: true
Attachment #387588 - Flags: superreview?(jst)
Attachment #387588 - Flags: review?(hsivonen)
Depends on: 507119

Updated

9 years ago
Depends on: 507872

Updated

9 years ago
Depends on: 512310
status1.9.2: --- → beta1-fixed
Flags: wanted1.9.2?
Flags: wanted1.9.2-
Flags: blocking1.9.2?
Flags: blocking1.9.2-
Depends on: 520421
Depends on: 519726
No longer depends on: 520421
Depends on: 521345
No longer blocks: 568516

Updated

8 years ago
Depends on: 601425
You need to log in before you can comment on or make changes to this bug.