Last Comment Bug 650787 - nsPlainTextSerializer: Avoid dependencies on the old HTML parser when converting HTML to plain text
: nsPlainTextSerializer: Avoid dependencies on the old HTML parser when convert...
Status: RESOLVED FIXED
:
Product: MailNews Core
Classification: Components
Component: MIME (show other bugs)
: unspecified
: All All
: -- normal (vote)
: Thunderbird 13.0
Assigned To: Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26)
:
Mentors:
Depends on: 102699 368758 650784 732691
Blocks: 650775 650776 732070
  Show dependency treegraph
 
Reported: 2011-04-18 07:17 PDT by Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26)
Modified: 2012-04-20 03:54 PDT (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
PlainTextSerializer as HTML (153.39 KB, patch)
2011-09-26 19:28 PDT, william bradshaw
no flags Details | Diff | Splinter Review
Use nsIScriptableHTMLParser for HTML to plain text conversion (11.96 KB, patch)
2012-01-26 05:15 PST, Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26)
no flags Details | Diff | Splinter Review
Use nsIParseUtils for plain text conversion (11.11 KB, patch)
2012-02-27 06:17 PST, Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26)
no flags Details | Diff | Splinter Review
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging (12.57 KB, patch)
2012-02-28 02:51 PST, Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26)
no flags Details | Diff | Splinter Review
Use nsIParseUtils for plain text conversion, without already-landed packaging changes (11.11 KB, patch)
2012-03-07 02:39 PST, Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26)
mozilla: review+
Details | Diff | Splinter Review

Description Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2011-04-18 07:17:43 PDT
It appears that comm-central has multiple places where the old HTML parser is used with nsPlainTextSerializer as a sink to convert HTML to plain text.

The old parser and the sink interfaces are going to go away, so these conversions in mailnews core should use the new parser and a plain text serializer that takes a DOM (sub)tree instead of being a parser sink. Code consolidation with bug 650784 would make sense.
Comment 1 Ben Bucksch (:BenB) 2011-04-23 01:12:13 PDT
I strongly recommend against a reimplementation. nsPlainTextSerializer is very involved and delicate code. It took a lot of fine-tuning to get it right in all cases: copy-paste, normal plaintext, and format=flowed. Each of these by itself has totally non-obvious edge cases that needed to be considered. Rewriting this is bound to cause a lot of regressions.

I suggest to simply move this implementation to the new APIs, whatever they are.
Comment 2 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2011-04-26 00:03:21 PDT
(In reply to comment #1)
> I suggest to simply move this implementation to the new APIs, whatever they
> are.

nsPlainTextSerializer seems to inherit from nsIContentSerializer already. Does it already actually work as an nsIContentSerializer impl? If it does, I suggest simply dropping support for the nsIHTMLContentSink and nsIHTMLToTextSink interfaces and using nsPlainTextSerializer only as an nsIContentSerializer implementation.
Comment 3 Ben Bucksch (:BenB) 2011-04-26 03:10:48 PDT
> Does it already actually work as an nsIContentSerializer impl?

I don't think anybody knows, and I think you're best suited to find out.
Comment 4 william bradshaw 2011-09-21 05:03:43 PDT
I am having some success reimplementing nsPlainTextSerializer by implementing it as a specialization of nsHTMLContentSerializer. Although it may seem counter intuitive nsHTMLContentSerializer has most of the code already written. I'm just modifying it to fit ie strip out the appending of the open and close tags. also the newer implementation is much more elegant and simplifies the coding significantly.

It would help me if someone could write some test cases involving lists and tables or just more through tests in general.
Also I am wondering whether I should Ignore canvas tags in the newer code?
Comment 5 william bradshaw 2011-09-21 05:07:05 PDT
Blocks: 563890
Comment 6 Ludovic Hirlimann [:Usul] 2011-09-21 05:09:23 PDT
William can you post your patch in progress.
Comment 7 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2011-09-21 05:55:43 PDT
(In reply to william bradshaw from comment #5)
> Blocks: 563890

Why?
Comment 8 william bradshaw 2011-09-21 07:48:21 PDT
(In reply to Henri Sivonen (:hsivonen) from comment #7)
> (In reply to william bradshaw from comment #5)
> > Blocks: 563890
> 
> Why?.

Missed the differance of the :I" thought nsPlainTextSerializer was an nsHTMLContentSink it's not it's a nsIHTMLContentSink sorry about that my bad. I'll post my in progress patch latter tonight when I get back to my dorm.
Comment 9 Joshua Cranmer [:jcranmer] 2011-09-21 09:53:03 PDT
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> (In reply to comment #1)
> > I suggest to simply move this implementation to the new APIs, whatever they
> > are.
> 
> nsPlainTextSerializer seems to inherit from nsIContentSerializer already.
> Does it already actually work as an nsIContentSerializer impl? If it does, I
> suggest simply dropping support for the nsIHTMLContentSink and
> nsIHTMLToTextSink interfaces and using nsPlainTextSerializer only as an
> nsIContentSerializer implementation.

<http://www.tjhsst.edu/~jcranmer/c-ccov/content/base/src/nsPlainTextSerializer.cpp.gcov.html> seems to suggest that it is not tested as such (there may be stuff in mochitests/reftests/etc., but I doubt it). Given how poorly tested the code is, I am not sure I would trust just flipping the impl use and seeing if it works.
Comment 10 Ben Bucksch (:BenB) 2011-09-22 23:58:46 PDT
william bradshaw, thanks for looking into this.

> also the newer implementation is much more elegant
> and simplifies the coding significantly.

When changing the implementation *logic* (not just APIs), please be super careful. This class was fine-tuned for many use–cases, including Thunderbird plaintext mail send, format=flowed, clipboard paste from browser, again in two variants: rich and simple text, and no linebreaks. Also test pasting of URLs.

This is very prone to regressions in edge cases, so I'm very vary of re-implementations or any changes to the logic whatsoever in fact.
Comment 11 william bradshaw 2011-09-26 19:11:40 PDT
I think I've found that my success in implementing nsPlainTextSerializer is limited to marginal success and am finding the deeper I go the more things don't work as I thought they did.  I'm now siding with the not reimplementing crowd. Although it seems the idea seems to work on a high level small details of the way functions are implemented would cause me modify virtually every function of nsHTMLSerializer to attain full behavioral recreation. Although at this time (for time saving purposes) just moving to new API's is looking rather appealing to me. I would however advocate Bug 650784 or changing implementation to more closely follow the model set up by the nsXMLSerializer class tree to provide a more consistent code base.
Comment 12 william bradshaw 2011-09-26 19:28:08 PDT
Created attachment 562637 [details] [diff] [review]
PlainTextSerializer as HTML
Comment 13 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2012-01-26 05:15:57 PST
Created attachment 591759 [details] [diff] [review]
Use nsIScriptableHTMLParser for HTML to plain text conversion

This need parts 1 and 1.5 from bug 650784 on m-c.
Comment 14 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2012-01-27 07:36:06 PST
Comment on attachment 591759 [details] [diff] [review]
Use nsIScriptableHTMLParser for HTML to plain text conversion

Hmm. This should probably use do_GetService instead of do_CreateInstance.
Comment 15 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2012-02-27 06:17:33 PST
Created attachment 600887 [details] [diff] [review]
Use nsIParseUtils for plain text conversion
Comment 16 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2012-02-28 02:51:01 PST
Created attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging
Comment 17 Serge Gautherie (:sgautherie) 2012-03-01 01:57:48 PST
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

Review of attachment 601225 [details] [diff] [review]:
-----------------------------------------------------------------

::: suite/installer/package-manifest.in
@@ +201,5 @@
>  @BINPATH@/components/filepicker.xpt
>  #endif
>  @BINPATH@/components/find.xpt
>  @BINPATH@/components/gfx.xpt
> +@BINPATH@/components/html5.xpt

Ah, I just did it in bug 731668 :-|
Comment 18 Mark Banner (:standard8) 2012-03-02 01:03:06 PST
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

>diff --git a/mail/installer/package-manifest.in b/mail/installer/package-manifest.in
>+@BINPATH@/components/html5.xpt

Due to bug 732070, I've landed this part of the patch with r=me. I'll let David look at the rest later.
Comment 19 Mark Banner (:standard8) 2012-03-02 01:04:53 PST
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

The changeset was: http://hg.mozilla.org/comm-central/rev/8eca2a2319d8

(Sorry, I forgot to correct the username after extracting that)
Comment 20 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2012-03-07 02:39:54 PST
Created attachment 603650 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, without already-landed packaging changes
Comment 21 Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2016-09-26) 2012-03-08 06:27:32 PST
Thanks for the r+. Landed:
http://hg.mozilla.org/comm-central/rev/766214fa44ab

Note You need to log in before you can comment on or make changes to this bug.