Last Comment Bug 650784 - Implement HTML to plain text conversion as a DOM walker without nsParser dependency
: Implement HTML to plain text conversion as a DOM walker without nsParser depe...
Status: RESOLVED FIXED
: addon-compat, dev-doc-needed
Product: Core
Classification: Components
Component: Serializers (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla13
Assigned To: Henri Sivonen (:hsivonen)
:
Mentors:
Depends on: 733935 721313 731896
Blocks: 282244 650775 720623 650787 729044 731668 732070
  Show dependency treegraph
 
Reported: 2011-04-18 07:10 PDT by Henri Sivonen (:hsivonen)
Modified: 2014-11-10 05:36 PST (History)
13 users (show)
bugzillamozillaorg_serge_20140323: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
WIP (24.13 KB, patch)
2012-01-20 06:39 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Part 1: Remove the dependency on nsParser (32.21 KB, patch)
2012-01-23 08:08 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Part 2: Remove dependency on nsIParserNode and nsIParserService (45.62 KB, patch)
2012-01-23 08:09 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Part 1: Remove the dependency on nsParser, v2 (39.89 KB, patch)
2012-01-25 05:08 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Part 2: Remove dependency on nsIParserNode and nsIParserService, v2 (47.49 KB, patch)
2012-01-25 05:28 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Part 1: Introduce the new API (26.95 KB, patch)
2012-01-26 01:31 PST, Henri Sivonen (:hsivonen)
bugs: review-
Details | Diff | Review
Part 2: Make nsPlainText serializer not be a content sink (12.36 KB, patch)
2012-01-26 01:36 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Part 3: Remove dependency on nsIParserNode and nsIParserService, v3 (47.49 KB, patch)
2012-01-26 01:38 PST, Henri Sivonen (:hsivonen)
bugs: review-
Details | Diff | Review
Part 1.5: Move nsScriptableUnescapeHTML from Toolkit to Core [Checked in: Comment 47] (20.40 KB, patch)
2012-01-26 04:44 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Part 0: Fold old nsParserUtils into nsContentUtils (46.33 KB, patch)
2012-02-20 06:36 PST, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Part 0: Fold old nsParserUtils into nsContentUtils, v2 [Checked in: Comment 47] (45.44 KB, patch)
2012-02-20 23:51 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Part 4: Scripted test case [Checked in: Comment 47] (4.11 KB, patch)
2012-02-21 01:38 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Part 3: Remove dependency on nsIParserNode and nsIParserService, v4 [Checked in: Comment 54] (47.43 KB, patch)
2012-02-21 01:44 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Part 1: Introduce the new API, v2 [Checked in: Comment 47] (25.23 KB, patch)
2012-02-21 01:45 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Part 0.5: Let the build system know that parser/html/ can have .idl files [Checked in: Comment 47] (3.79 KB, patch)
2012-02-21 22:49 PST, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review
Part 2: Make nsPlainText serializer not be a content sink, unrotted [Checked in: Comment 54] (12.12 KB, patch)
2012-03-07 02:30 PST, Henri Sivonen (:hsivonen)
hsivonen: review+
Details | Diff | Review

Description Henri Sivonen (:hsivonen) 2011-04-18 07:10:32 PDT
It appears that currently clipboard-related conversions from HTML to plain text use http://mxr.mozilla.org/mozilla-central/source/widget/src/xpwidgets/nsHTMLFormatConverter.cpp#276 which uses the old parser and relies on nsPlainTextSerializer being an nsIHTMLContentSink.

The old parser and the old content sink interfaces are going to go away. These conversions should use code that uses the new HTML parser to go from HTML source into a DOM if necessary and then serialization code that takes a DOM (sub)tree and outputs plain text.
Comment 1 Ben Bucksch (:BenB) 2011-09-27 01:36:49 PDT
Please note the warnings at bug 650787 comment 1, 10 and 11.
Comment 2 Henri Sivonen (:hsivonen) 2012-01-20 06:39:53 PST
Created attachment 590186 [details] [diff] [review]
WIP

This patch moves conversion to plain text to behind a static method in nsContentUtils instead of the XPCOM stuff we had before. What should I do about the TestHarness.h-based test can't call nsContentUtils?
Comment 3 Ben Bucksch (:BenB) 2012-01-20 06:51:32 PST
Comment on attachment 590186 [details] [diff] [review]
WIP

Henri, the converters are used in many places, also in Thunderbird and various in extensions. They must be exposed by XPCOM. Where similar functionality had not been properly exposed to JavaScript, it posed a huge problem for many parties, I've often heard people desperately needing it but having big problems to get to it from JS. If anything, it must be more accessible from XPCOM, not less.

This is really important functionality that must be available to all components, like spell-checker, gloda, addon-manager wanting to display a HTML snipplet as plaintext, or anything else that needs to convert HTML to plaintext.
Comment 4 Ben Bucksch (:BenB) 2012-01-20 06:53:06 PST
Comment on attachment 590186 [details] [diff] [review]
WIP

As one of the authors of this component, I hope I don't overstep my authority by doing review- on this.
Comment 5 Henri Sivonen (:hsivonen) 2012-01-20 07:08:55 PST
(In reply to Ben Bucksch (:BenB) from comment #3)
> Henri, the converters are used in many places, also in Thunderbird 

I'm planning to change Thunderbird call sites.

> They must be exposed by XPCOM.

XPCOM-exposed content sinks are going away and the old parser that can feed into them is going away. The old way of using this stuff isn't going to stay around.

However, I'd be OK with wrapping the new nsContentUtils method in something that can be called through XPCOM.

Would it work for you if nsScriptableUnescapeHTML (exposed via XPCOM) allowed the caller to pass flags and the column wrap?
Comment 6 Ben Bucksch (:BenB) 2012-01-20 11:47:18 PST
Comment on attachment 590186 [details] [diff] [review]
WIP

Yes, any way for JS callers to pass HTML through it and get plaintext using this class/implementation, and passing flags to it to control it, would be fine with me. Thanks!
Revoking my objection.
Comment 7 Henri Sivonen (:hsivonen) 2012-01-23 08:08:20 PST
Created attachment 590721 [details] [diff] [review]
Part 1: Remove the dependency on nsParser

This patch exposes all functionality via XPCOM.
Comment 8 Henri Sivonen (:hsivonen) 2012-01-23 08:09:26 PST
Created attachment 590722 [details] [diff] [review]
Part 2: Remove dependency on nsIParserNode and nsIParserService
Comment 9 Henri Sivonen (:hsivonen) 2012-01-25 05:08:10 PST
Created attachment 591430 [details] [diff] [review]
Part 1: Remove the dependency on nsParser, v2
Comment 10 Henri Sivonen (:hsivonen) 2012-01-25 05:28:30 PST
Created attachment 591432 [details] [diff] [review]
Part 2: Remove dependency on nsIParserNode and nsIParserService, v2
Comment 11 Henri Sivonen (:hsivonen) 2012-01-26 01:31:00 PST
Created attachment 591719 [details] [diff] [review]
Part 1: Introduce the new API
Comment 12 Henri Sivonen (:hsivonen) 2012-01-26 01:36:02 PST
Created attachment 591720 [details] [diff] [review]
Part 2: Make nsPlainText serializer not be a content sink
Comment 13 Henri Sivonen (:hsivonen) 2012-01-26 01:38:25 PST
Created attachment 591722 [details] [diff] [review]
Part 3: Remove dependency on nsIParserNode and nsIParserService, v3

Some bits in the last part would logically belong in the middle part, but the last two parts will land together anyway.
Comment 14 Henri Sivonen (:hsivonen) 2012-01-26 01:54:09 PST
Dev doc relevance:
This change affects extension and XULRunner documentation. (Not Web platform documentation.)

nsIHTMLToTextSink is removed. nsIScriptableHTMLParser is introduced. (Instantiated with NS_SCRIPTABLEHTMLPARSER_CONTRACTID / "@mozilla.org/scriptablehtmlparser;1".)
Comment 15 Henri Sivonen (:hsivonen) 2012-01-26 03:34:37 PST
Should I move nsScriptableUnescape.h/cpp to some place considered part of Core? I left it where it was in order to avoid breaking history.
Comment 16 Henri Sivonen (:hsivonen) 2012-01-26 04:44:21 PST
Created attachment 591755 [details] [diff] [review]
Part 1.5: Move nsScriptableUnescapeHTML from Toolkit to Core
[Checked in: Comment 47]
Comment 17 Olli Pettay [:smaug] 2012-02-07 03:10:23 PST
You should use the new license.
Comment 18 Olli Pettay [:smaug] 2012-02-07 03:15:40 PST
Comment on attachment 591719 [details] [diff] [review]
Part 1: Introduce the new API

>+[scriptable, uuid(290f49bb-0619-4bda-8006-ab31bec7231a)]
>+interface nsIScriptableHTMLParser : nsISupports
Kind of strange name.
Would nsIHTMLToTextConverter be ok?
It could then have convert() method

Also, the new interface should have some tests.
(perhaps they are in some other patch for this bug)
Comment 19 Olli Pettay [:smaug] 2012-02-07 03:18:29 PST
Comment on attachment 591755 [details] [diff] [review]
Part 1.5: Move nsScriptableUnescapeHTML from Toolkit to Core
[Checked in: Comment 47]

This needs some minor changes if scriptableHTMLParser is renamed
Comment 20 Olli Pettay [:smaug] 2012-02-07 03:24:19 PST
Comment on attachment 591720 [details] [diff] [review]
Part 2: Make nsPlainText serializer not be a content sink

Horrible old code
Comment 21 Olli Pettay [:smaug] 2012-02-07 03:39:28 PST
Comment on attachment 591722 [details] [diff] [review]
Part 3: Remove dependency on nsIParserNode and nsIParserService, v3


>+nsContentUtils::IsHTMLBlock(nsIAtom* aLocalName)
>+{
>+  return (aLocalName == nsGkAtoms::address) ||
>+    (aLocalName == nsGkAtoms::article) ||
Might look nicer if (aLocalName == nsGkAtoms::address) ||
was in its own line


>-      rv = DoAddLeaf(nsnull,
>-                     eHTMLTag_text,
>-                     Substring(textstr, start, length-start));
>+      DoAddText(false, Substring(textstr, start, length-start));
space before and after -


>+nsPlainTextSerializer::DoAddText(bool aIsLineBreak, const nsAString& aText)
...
>+  if ((mTagStackIndex > 1 &&
>+       mTagStack[mTagStackIndex-2] == nsGkAtoms::select) ||
>+      (mTagStackIndex > 0 &&
>+        mTagStack[mTagStackIndex-1] == nsGkAtoms::select)) {
>+    // Don't output the contents of SELECT elements;
>+    // Might be nice, eventually, to output just the selected element.
>+    // Read more in bug 31994.
>+    return;
>+  }
>+
>+  if (mTagStackIndex > 0 &&
>+      (mTagStack[mTagStackIndex-1] == nsGkAtoms::script ||
>+       mTagStack[mTagStackIndex-1] == nsGkAtoms::style)) {
>+    // Don't output the contents of <script> or <style> tags;
>+    return;
>+  }
So this code is now in two places.
Could you add some helper method for it.
Comment 22 Henri Sivonen (:hsivonen) 2012-02-07 04:58:58 PST
(In reply to Olli Pettay [:smaug] from comment #18)
> >+interface nsIScriptableHTMLParser : nsISupports
> Kind of strange name.
> Would nsIHTMLToTextConverter be ok?
> It could then have convert() method

This isn't really a plain text converter interface. It's an interface for exposing entry points to the HTML parser to scripts and to outside libxul. These methods are logically like static methods, but they are on an interface, because that's how XPCOM does stuff.

Considering what's now in the pipeline, bug 650776 will add more stuff to this interface.

Maybe putting "HTML" in the interface name isn't great idea, since folding the functionality of nsIScriptableUnescapeHTML would already introduce an entry point to the XML parser. But, OTOH, it doesn't matter that much if a primarily HTML parser interface has some XML entry points, too.

Are you OK with introducing an XPCOM service with a bunch of methods to be called by code that can't call nsContentUtils directly? I'd rather not add a bunch of one-off interfaces like nsIHTMLToTextConverter now that we don't guarantee frozen vtables anyway. How would you name such a service? nsIParserService is taken and I thought I'd keep new stuff on its way in and legacy stuff on its way out separate.

> Also, the new interface should have some tests.
> (perhaps they are in some other patch for this bug)

The patch changes content/base/test/TestPlainTextSerializer.cpp to call the new interface. The serializer functionality behind the interface hasn't expanded, so retargeting the existing test coverage to the new entry point seemed appropriate.
Comment 23 Olli Pettay [:smaug] 2012-02-07 05:02:54 PST
(In reply to Henri Sivonen (:hsivonen) from comment #22)
> Are you OK with introducing an XPCOM service with a bunch of methods to be
> called by code that can't call nsContentUtils directly?
That sounds good. We even used to have nsIContentUtils.
The name was terrible though.

 I'd rather not add a
> bunch of one-off interfaces like nsIHTMLToTextConverter now that we don't
> guarantee frozen vtables anyway. How would you name such a service?
> nsIParserService is taken and I thought I'd keep new stuff on its way in and
> legacy stuff on its way out separate.
nsIParserUtils? or nsIParsingUtils


> The patch changes content/base/test/TestPlainTextSerializer.cpp to call the
> new interface. The serializer functionality behind the interface hasn't
> expanded, so retargeting the existing test coverage to the new entry point
> seemed appropriate.
Just test that the new things works also in JS (since you're making the interface scriptable.)
Comment 24 Henri Sivonen (:hsivonen) 2012-02-08 06:49:37 PST
(In reply to Olli Pettay [:smaug] from comment #23)
> nsIParserUtils?

We already have nsParserUtils that's pretty much unrelated to parsing. OK if I rename that one to nsAttributeUtils to get it out of the way?
Comment 25 Olli Pettay [:smaug] 2012-02-08 08:06:58 PST
What about merging the current nsParserUtils to nsContentUtils, and then 
use nIParserUtils for this new stuff.
Comment 26 :Ms2ger 2012-02-08 08:14:52 PST
Ahem, can we stop moving this bug into Core::Selection?
Comment 27 Ben Bucksch (:BenB) 2012-02-08 18:30:37 PST
Before you move components around: "Utils" is a pretty bad name anyway. I'd recommend something with "convert" or similar in the name.
Comment 28 Henri Sivonen (:hsivonen) 2012-02-09 00:35:52 PST
(In reply to Ms2ger from comment #26)
> Ahem, can we stop moving this bug into Core::Selection?

That happens due to some interaction between lazy-loading tabs and form state restoration. The changes are accidental and involuntary.

(In reply to Olli Pettay [:smaug] from comment #25)
> What about merging the current nsParserUtils to nsContentUtils, and then 
> use nIParserUtils for this new stuff.

Sold!

(In reply to Ben Bucksch (:BenB) from comment #27)
> Before you move components around: "Utils" is a pretty bad name anyway. I'd
> recommend something with "convert" or similar in the name.

Naming is a total bikeshed and I already spent time renaming stuff to nsIParserUtils. Also, I expect the interface to gain non-conversion methods in due course.
Comment 29 Ben Bucksch (:BenB) 2012-02-09 06:09:59 PST
> Naming is a total bikeshed

Yes, true, but naming is also the reason why nobody used nsIFormatConverter (bug 720623): nobody found it, nobody knew it. For me, "utils" means internal, minor stuff that is not very interesting. I would *never* look there, if I was searching for a HTML->TXT converter.

But whatever, we'll see.
Comment 30 Henri Sivonen (:hsivonen) 2012-02-20 06:36:02 PST
Created attachment 598845 [details] [diff] [review]
Part 0: Fold old nsParserUtils into nsContentUtils

While at it, I removed a duplicated instance of the PI pseudo attribute parsing code and named the methods in a way that's more appropriate in terms of xml-stylesheet terminology.
Comment 31 :Ms2ger 2012-02-20 08:51:34 PST
I assume the change in nsHtml5TreeOpExecutor.h wasn't meant for this patch?
Comment 32 Henri Sivonen (:hsivonen) 2012-02-20 23:51:08 PST
Created attachment 599058 [details] [diff] [review]
Part 0: Fold old nsParserUtils into nsContentUtils, v2
[Checked in: Comment 47]

> I assume the change in nsHtml5TreeOpExecutor.h wasn't meant for this patch?

Correct. Thanks.
Comment 33 Henri Sivonen (:hsivonen) 2012-02-21 01:38:34 PST
Created attachment 599067 [details] [diff] [review]
Part 4: Scripted test case
[Checked in: Comment 47]
Comment 34 Henri Sivonen (:hsivonen) 2012-02-21 01:44:05 PST
Created attachment 599068 [details] [diff] [review]
Part 3: Remove dependency on nsIParserNode and nsIParserService, v4
[Checked in: Comment 54]
Comment 35 Henri Sivonen (:hsivonen) 2012-02-21 01:45:18 PST
Created attachment 599070 [details] [diff] [review]
Part 1: Introduce the new API, v2
[Checked in: Comment 47]
Comment 36 Olli Pettay [:smaug] 2012-02-21 04:25:12 PST
Comment on attachment 599058 [details] [diff] [review]
Part 0: Fold old nsParserUtils into nsContentUtils, v2
[Checked in: Comment 47]


>+nsContentUtils::IsJavaScriptLanguage(const nsString& aName, PRUint32 *aFlags)
>+{
>+  JSVersion version = JSVERSION_UNKNOWN;
>+
>+  if (aName.LowerCaseEqualsLiteral("javascript") ||
>+      aName.LowerCaseEqualsLiteral("livescript") ||
>+      aName.LowerCaseEqualsLiteral("mocha")) {
>+    version = JSVERSION_DEFAULT;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.0")) {
>+    version = JSVERSION_1_0;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.1")) {
>+    version = JSVERSION_1_1;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.2")) {
>+    version = JSVERSION_1_2;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.3")) {
>+    version = JSVERSION_1_3;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.4")) {
>+    version = JSVERSION_1_4;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.5")) {
>+    version = JSVERSION_1_5;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.6")) {
>+    version = JSVERSION_1_6;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.7")) {
>+    version = JSVERSION_1_7;
>+  }
>+  else if (aName.LowerCaseEqualsLiteral("javascript1.8")) {
>+    version = JSVERSION_1_8;
>+  }
>+  if (version == JSVERSION_UNKNOWN)
>+    return false;

I know, you're just moving code, but could you fix the coding style
if (expr) {
  stmt;
}
and
} else if (expr) {
Comment 37 Henri Sivonen (:hsivonen) 2012-02-21 06:28:17 PST
Bah. Now that TBPL is back, it shows failures where there were none locally. :-(
Comment 38 Ben Bucksch (:BenB) 2012-02-21 10:37:33 PST
Comment on attachment 599068 [details] [diff] [review]
Part 3: Remove dependency on nsIParserNode and nsIParserService, v4
[Checked in: Comment 54]

Ok, these are far-reaching changes. PlainTextSerializer has
* so many use cases (format=flowed, rich plaintext and copy&paste)
* delicate edge cases (various block elements, vertical and horizontal white space etc.) - basically similar to the HTML Editor rules
* was created a decade before we had a useable testsuite and
* is highly prone to regressions

This has to be reviewed by somebody who has been involved in its creation (akk, Daniel or me, but not even I feel comfortable reviewing this, far too many changes).
Comment 39 Ben Bucksch (:BenB) 2012-02-21 10:43:08 PST
Thunderbird is highly dependent on this class.

One practical way to test is to run a large number of both HTML documents and HTML emails through the old version and the new version and see whether the result matches.

Only exception is bug 314213: We need an empty line after <blockquote type=site>, but it's currently not there. You'd be highly welcome to fix this :). Please mind nested quotes, too.
Comment 40 Henri Sivonen (:hsivonen) 2012-02-21 22:49:08 PST
Created attachment 599488 [details] [diff] [review]
Part 0.5: Let the build system know that parser/html/ can have .idl files
[Checked in: Comment 47]

This patch fixes the tryserver issues mentioned in comment 37.
Comment 41 Henri Sivonen (:hsivonen) 2012-02-21 22:59:01 PST
(In reply to Ben Bucksch (:BenB) from comment #38)
> Comment on attachment 599068 [details] [diff] [review]
> Part 3: Remove dependency on nsIParserNode and nsIParserService, v4
> 
> Ok, these are far-reaching changes.

I think you are exaggerating. Part 3 does a pretty straight-forward substitution of old HTML tag enum values with corresponding nsGkAtoms. The most complex-looking changes arise, because in the new version, text nodes don't have a tag id enum value.

nsContentUtils::IsHTMLVoid is trivial and easy to verify to be correct.

nsContentUtils::IsHTMLBlock is derived by reading what the old code did. It should be possible to verify its equivalence to the old code by looking at the old and new code carefully. I welcome you to look at the old and new code carefully to check that I read the old code correctly.

(In reply to Ben Bucksch (:BenB) from comment #39)
> Thunderbird is highly dependent on this class.

Does Thunderbird accordingly have unit tests for this?

> One practical way to test is to run a large number of both HTML documents
> and HTML emails through the old version and the new version and see whether
> the result matches.

Since you seem to have an idea of what cases are interesting, it would be helpful if you could supply test cases (as HTML files--not in email wrappers).
Comment 42 Olli Pettay [:smaug] 2012-02-26 16:19:27 PST
Comment on attachment 599488 [details] [diff] [review]
Part 0.5: Let the build system know that parser/html/ can have .idl files
[Checked in: Comment 47]

rs=me
Comment 43 Olli Pettay [:smaug] 2012-02-26 16:26:00 PST
Comment on attachment 599070 [details] [diff] [review]
Part 1: Introduce the new API, v2
[Checked in: Comment 47]


>+  /**
>+   * Converts HTML source to plain text by parsing the source and using the
>+   * plain text serializer on the resulting tree.
>+   *
>+   * @param aSourceBuffer the string to parse as an HTML document
>+   * @param aResultBuffer the string where the plain text result appears;
>+   *                      may be the same string as aSourceBuffer
>+   * @param aFlags Flags from nsIDocumentEncoder.
>+   * @param aWrapCol Number of columns after which to line wrap
0 prevents wrapping, right? If so, could you please document it.
Comment 44 Olli Pettay [:smaug] 2012-02-26 16:58:14 PST
Comment on attachment 599068 [details] [diff] [review]
Part 3: Remove dependency on nsIParserNode and nsIParserService, v4
[Checked in: Comment 54]

>+nsContentUtils::IsHTMLBlock(nsIAtom* aLocalName)
>+{
>+  return
>+    (aLocalName == nsGkAtoms::address) ||
>+    (aLocalName == nsGkAtoms::article) ||
>+    (aLocalName == nsGkAtoms::aside) ||
>+    (aLocalName == nsGkAtoms::blockquote) ||
>+    (aLocalName == nsGkAtoms::center) ||
>+    (aLocalName == nsGkAtoms::dir) ||
>+    (aLocalName == nsGkAtoms::div) ||
>+    (aLocalName == nsGkAtoms::dl) || // XXX why not dt and dd?
>+    (aLocalName == nsGkAtoms::fieldset) ||
>+    (aLocalName == nsGkAtoms::figure) || // XXX shouldn't figcaption be on this list
>+    (aLocalName == nsGkAtoms::footer) ||
>+    (aLocalName == nsGkAtoms::form) ||
>+    (aLocalName == nsGkAtoms::h1) ||
>+    (aLocalName == nsGkAtoms::h2) ||
>+    (aLocalName == nsGkAtoms::h3) ||
>+    (aLocalName == nsGkAtoms::h4) ||
>+    (aLocalName == nsGkAtoms::h5) ||
>+    (aLocalName == nsGkAtoms::h6) ||
>+    (aLocalName == nsGkAtoms::header) ||
>+    (aLocalName == nsGkAtoms::hgroup) ||
>+    (aLocalName == nsGkAtoms::hr) ||
>+    (aLocalName == nsGkAtoms::li) ||
>+    (aLocalName == nsGkAtoms::listing) ||
>+    (aLocalName == nsGkAtoms::menu) ||
>+    (aLocalName == nsGkAtoms::multicol) || // XXX get rid of this one?
>+    (aLocalName == nsGkAtoms::nav) ||
>+    (aLocalName == nsGkAtoms::ol) ||
>+    (aLocalName == nsGkAtoms::p) ||
>+    (aLocalName == nsGkAtoms::pre) ||
>+    (aLocalName == nsGkAtoms::section) ||
>+    (aLocalName == nsGkAtoms::table) ||
>+    (aLocalName == nsGkAtoms::ul) ||
>+    (aLocalName == nsGkAtoms::xmp);
>+}
Please file followups for the XXX

>-        if(NS_SUCCEEDED(GetAttributeValue(aNode, nsGkAtoms::start, startAttr))){
>+        if(NS_SUCCEEDED(GetAttributeValue(nsGkAtoms::start, startAttr))){
space after 'if' and before '{'

>-  else if (IsBlockLevel(aTag)
>-           && type != eHTMLTag_script
>-           && type != eHTMLTag_doctypeDecl
>-           && type != eHTMLTag_markupDecl) {
>+  else if (nsContentUtils::IsHTMLBlock(aTag)
>+           && aTag != nsGkAtoms::script) {
Hmm, where do we handle doctype serialization?


>+    nsAutoString aTagAttr;
Badly named variable. aFoo is for parameters.


> nsPlainTextSerializer::IsInPre()
> {
>   PRInt32 i = mTagStackIndex;
>   while(i > 0) {
>-    if(mTagStack[i-1] == eHTMLTag_pre)
>+    if(mTagStack[i-1] == nsGkAtoms::pre)
>       return true;
>-    if(IsBlockLevel(mTagStack[i-1])) {
>+    if(nsContentUtils::IsHTMLBlock(mTagStack[i-1])) {
i - 1, and space after 'if'

Tricky to review.
Comment 45 Olli Pettay [:smaug] 2012-02-26 17:03:41 PST
Comment on attachment 599067 [details] [diff] [review]
Part 4: Scripted test case
[Checked in: Comment 47]

Could you put the test to content/base/test/chrome
Comment 46 Henri Sivonen (:hsivonen) 2012-02-27 04:00:52 PST
(In reply to Olli Pettay [:smaug] from comment #36)
> Comment on attachment 599058 [details] [diff] [review]
> Part 0: Fold old nsParserUtils into nsContentUtils, v2
...
> I know, you're just moving code, but could you fix the coding style
> if (expr) {
>   stmt;
> }
> and
> } else if (expr) {

Fixed.

(In reply to Olli Pettay [:smaug] from comment #43)
> Comment on attachment 599070 [details] [diff] [review]
> Part 1: Introduce the new API, v2
...
> 0 prevents wrapping, right? If so, could you please document it.

Documented.

(In reply to Olli Pettay [:smaug] from comment #45)
> Comment on attachment 599067 [details] [diff] [review]
> Part 4: Scripted test case
> 
> Could you put the test to content/base/test/chrome

Moved there.

Thanks.

Landed parts other than 2 and 3:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ccebf6862a9
https://hg.mozilla.org/integration/mozilla-inbound/rev/d40ba365b995
https://hg.mozilla.org/integration/mozilla-inbound/rev/429263e58090
https://hg.mozilla.org/integration/mozilla-inbound/rev/a423ca3c3ce1
https://hg.mozilla.org/integration/mozilla-inbound/rev/8ea9dc2f8570

comm-central needs a patch so that it starts using the API introduced in part 1 before parts 2 and 3 can land without burning c-c.
Comment 48 Serge Gautherie (:sgautherie) 2012-03-01 02:01:10 PST
(In reply to Henri Sivonen (:hsivonen) from comment #46)
> comm-central needs a patch so that it starts using the API introduced in
> part 1 before parts 2 and 3 can land without burning c-c.

Ftr, you filed that as bug 650787.
Comment 49 David :Bienvenu 2012-03-01 07:55:26 PST
Comment on attachment 599068 [details] [diff] [review]
Part 3: Remove dependency on nsIParserNode and nsIParserService, v4
[Checked in: Comment 54]

has this bit-rotted, or does it rely on some other patch to be applied first? It doesn't apply cleanly for me.
Comment 50 alta88 2012-03-01 16:07:13 PST
it seems the current landing has broken feeds use of nsIScriptableUnescapeHTML in Tb trunk.  Bug 732070.
Comment 51 Henri Sivonen (:hsivonen) 2012-03-02 06:27:13 PST
(In reply to David :Bienvenu from comment #49)
> Comment on attachment 599068 [details] [diff] [review]
> Part 3: Remove dependency on nsIParserNode and nsIParserService, v4
> 
> has this bit-rotted, 

No, AFAICT.

> or does it rely on some other patch to be applied
> first? It doesn't apply cleanly for me.

It requires part 2 (attachment 591720 [details] [diff] [review]).
Comment 52 Henri Sivonen (:hsivonen) 2012-03-07 02:30:09 PST
Created attachment 603648 [details] [diff] [review]
Part 2: Make nsPlainText serializer not be a content sink, unrotted
[Checked in: Comment 54]
Comment 53 Henri Sivonen (:hsivonen) 2012-03-08 06:48:09 PST
(In reply to Olli Pettay [:smaug] from comment #44)
> Comment on attachment 599068 [details] [diff] [review]
> Part 3: Remove dependency on nsIParserNode and nsIParserService, v4
> Please file followups for the XXX

Bug 734059, bug 734060 and bug 734063.

> >-        if(NS_SUCCEEDED(GetAttributeValue(aNode, nsGkAtoms::start, startAttr))){
> >+        if(NS_SUCCEEDED(GetAttributeValue(nsGkAtoms::start, startAttr))){
> space after 'if' and before '{'

Fixed throughout the file.

> >-  else if (IsBlockLevel(aTag)
> >-           && type != eHTMLTag_script
> >-           && type != eHTMLTag_doctypeDecl
> >-           && type != eHTMLTag_markupDecl) {
> >+  else if (nsContentUtils::IsHTMLBlock(aTag)
> >+           && aTag != nsGkAtoms::script) {
> Hmm, where do we handle doctype serialization?

It's a no-op in the .h file.

> >+    nsAutoString aTagAttr;
> Badly named variable. aFoo is for parameters.

Fixed.

> > nsPlainTextSerializer::IsInPre()
> > {
> >   PRInt32 i = mTagStackIndex;
> >   while(i > 0) {
> >-    if(mTagStack[i-1] == eHTMLTag_pre)
> >+    if(mTagStack[i-1] == nsGkAtoms::pre)
> >       return true;
> >-    if(IsBlockLevel(mTagStack[i-1])) {
> >+    if(nsContentUtils::IsHTMLBlock(mTagStack[i-1])) {
> i - 1, and space after 'if'

Fixed.

Thanks.

https://hg.mozilla.org/integration/mozilla-inbound/rev/0617488cb3b9
https://hg.mozilla.org/integration/mozilla-inbound/rev/9c8f60098608
https://hg.mozilla.org/integration/mozilla-inbound/rev/a674b76d8d35
Comment 55 Jorge Villalobos [:jorgev] 2012-05-03 16:57:43 PDT
I'm writing the Firefox 13 compat docs, so please correct me if I'm wrong: nsIScriptableUnescapeHTML can still be used, but it should be considered deprecated in favor of nsIParserUtils. Is this right?
Comment 56 Henri Sivonen (:hsivonen) 2012-05-04 00:27:20 PDT
(In reply to Jorge Villalobos [:jorgev] from comment #55)
> I'm writing the Firefox 13 compat docs, so please correct me if I'm wrong:
> nsIScriptableUnescapeHTML can still be used, but it should be considered
> deprecated in favor of nsIParserUtils. Is this right?

That's correct. Keeping nsIScriptableUnescapeHTML around is cheap, so there aren't any plans to remove is at present.  However, I would encourage extension developers to use the new APIs in order to pass flags that are appropriate for their use cases.  The flags implied by the old API may not be the most appropriate ones considering the use cases.

Note You need to log in before you can comment on or make changes to this bug.