nsPlainTextSerializer: Avoid dependencies on the old HTML parser when converting HTML to plain text

RESOLVED FIXED in Thunderbird 13.0

Status

MailNews Core
MIME
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: hsivonen, Assigned: hsivonen)

Tracking

(Blocks: 1 bug)

unspecified
Thunderbird 13.0
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 4 obsolete attachments)

(Assignee)

Description

6 years ago
It appears that comm-central has multiple places where the old HTML parser is used with nsPlainTextSerializer as a sink to convert HTML to plain text.

The old parser and the sink interfaces are going to go away, so these conversions in mailnews core should use the new parser and a plain text serializer that takes a DOM (sub)tree instead of being a parser sink. Code consolidation with bug 650784 would make sense.
(Assignee)

Updated

6 years ago
Blocks: 650775

Comment 1

6 years ago
I strongly recommend against a reimplementation. nsPlainTextSerializer is very involved and delicate code. It took a lot of fine-tuning to get it right in all cases: copy-paste, normal plaintext, and format=flowed. Each of these by itself has totally non-obvious edge cases that needed to be considered. Rewriting this is bound to cause a lot of regressions.

I suggest to simply move this implementation to the new APIs, whatever they are.
Summary: Avoid dependencies on the old HTML parser when converting HTML to plain text → nsPlainTextSerializer: Avoid dependencies on the old HTML parser when converting HTML to plain text
(Assignee)

Comment 2

6 years ago
(In reply to comment #1)
> I suggest to simply move this implementation to the new APIs, whatever they
> are.

nsPlainTextSerializer seems to inherit from nsIContentSerializer already. Does it already actually work as an nsIContentSerializer impl? If it does, I suggest simply dropping support for the nsIHTMLContentSink and nsIHTMLToTextSink interfaces and using nsPlainTextSerializer only as an nsIContentSerializer implementation.

Comment 3

6 years ago
> Does it already actually work as an nsIContentSerializer impl?

I don't think anybody knows, and I think you're best suited to find out.

Comment 4

6 years ago
I am having some success reimplementing nsPlainTextSerializer by implementing it as a specialization of nsHTMLContentSerializer. Although it may seem counter intuitive nsHTMLContentSerializer has most of the code already written. I'm just modifying it to fit ie strip out the appending of the open and close tags. also the newer implementation is much more elegant and simplifies the coding significantly.

It would help me if someone could write some test cases involving lists and tables or just more through tests in general.
Also I am wondering whether I should Ignore canvas tags in the newer code?

Comment 5

6 years ago
Blocks: 563890
William can you post your patch in progress.
Blocks: 563890
(Assignee)

Comment 7

6 years ago
(In reply to william bradshaw from comment #5)
> Blocks: 563890

Why?

Comment 8

6 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #7)
> (In reply to william bradshaw from comment #5)
> > Blocks: 563890
> 
> Why?.

Missed the differance of the :I" thought nsPlainTextSerializer was an nsHTMLContentSink it's not it's a nsIHTMLContentSink sorry about that my bad. I'll post my in progress patch latter tonight when I get back to my dorm.
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> (In reply to comment #1)
> > I suggest to simply move this implementation to the new APIs, whatever they
> > are.
> 
> nsPlainTextSerializer seems to inherit from nsIContentSerializer already.
> Does it already actually work as an nsIContentSerializer impl? If it does, I
> suggest simply dropping support for the nsIHTMLContentSink and
> nsIHTMLToTextSink interfaces and using nsPlainTextSerializer only as an
> nsIContentSerializer implementation.

<http://www.tjhsst.edu/~jcranmer/c-ccov/content/base/src/nsPlainTextSerializer.cpp.gcov.html> seems to suggest that it is not tested as such (there may be stuff in mochitests/reftests/etc., but I doubt it). Given how poorly tested the code is, I am not sure I would trust just flipping the impl use and seeing if it works.
(Assignee)

Updated

6 years ago
No longer blocks: 563890

Comment 10

6 years ago
william bradshaw, thanks for looking into this.

> also the newer implementation is much more elegant
> and simplifies the coding significantly.

When changing the implementation *logic* (not just APIs), please be super careful. This class was fine-tuned for many use–cases, including Thunderbird plaintext mail send, format=flowed, clipboard paste from browser, again in two variants: rich and simple text, and no linebreaks. Also test pasting of URLs.

This is very prone to regressions in edge cases, so I'm very vary of re-implementations or any changes to the logic whatsoever in fact.

Comment 11

6 years ago
I think I've found that my success in implementing nsPlainTextSerializer is limited to marginal success and am finding the deeper I go the more things don't work as I thought they did.  I'm now siding with the not reimplementing crowd. Although it seems the idea seems to work on a high level small details of the way functions are implemented would cause me modify virtually every function of nsHTMLSerializer to attain full behavioral recreation. Although at this time (for time saving purposes) just moving to new API's is looking rather appealing to me. I would however advocate Bug 650784 or changing implementation to more closely follow the model set up by the nsXMLSerializer class tree to provide a more consistent code base.

Comment 12

6 years ago
Created attachment 562637 [details] [diff] [review]
PlainTextSerializer as HTML
(Assignee)

Updated

6 years ago
Attachment #562637 - Attachment is patch: true
(Assignee)

Updated

5 years ago
Depends on: 102699
(Assignee)

Updated

5 years ago
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
(Assignee)

Comment 13

5 years ago
Created attachment 591759 [details] [diff] [review]
Use nsIScriptableHTMLParser for HTML to plain text conversion

This need parts 1 and 1.5 from bug 650784 on m-c.
Attachment #562637 - Attachment is obsolete: true
Attachment #591759 - Flags: review?(mbanner)
(Assignee)

Comment 14

5 years ago
Comment on attachment 591759 [details] [diff] [review]
Use nsIScriptableHTMLParser for HTML to plain text conversion

Hmm. This should probably use do_GetService instead of do_CreateInstance.
Attachment #591759 - Flags: review?(mbanner)
(Assignee)

Comment 15

5 years ago
Created attachment 600887 [details] [diff] [review]
Use nsIParseUtils for plain text conversion
Attachment #591759 - Attachment is obsolete: true
(Assignee)

Comment 16

5 years ago
Created attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging
Attachment #600887 - Attachment is obsolete: true
(Assignee)

Updated

5 years ago
Attachment #601225 - Flags: review?(dbienvenu)
(Assignee)

Updated

5 years ago
Blocks: 650776
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

Review of attachment 601225 [details] [diff] [review]:
-----------------------------------------------------------------

::: suite/installer/package-manifest.in
@@ +201,5 @@
>  @BINPATH@/components/filepicker.xpt
>  #endif
>  @BINPATH@/components/find.xpt
>  @BINPATH@/components/gfx.xpt
> +@BINPATH@/components/html5.xpt

Ah, I just did it in bug 731668 :-|

Updated

5 years ago
Depends on: 368758
Blocks: 732070
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

>diff --git a/mail/installer/package-manifest.in b/mail/installer/package-manifest.in
>+@BINPATH@/components/html5.xpt

Due to bug 732070, I've landed this part of the patch with r=me. I'll let David look at the rest later.
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

The changeset was: http://hg.mozilla.org/comm-central/rev/8eca2a2319d8

(Sorry, I forgot to correct the username after extracting that)

Updated

5 years ago
Depends on: 732691
(Assignee)

Comment 20

5 years ago
Created attachment 603650 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, without already-landed packaging changes
Attachment #601225 - Attachment is obsolete: true
Attachment #601225 - Flags: review?(dbienvenu)
Attachment #603650 - Flags: review?(dbienvenu)

Updated

5 years ago
Attachment #603650 - Flags: review?(dbienvenu) → review+
(Assignee)

Comment 21

5 years ago
Thanks for the r+. Landed:
http://hg.mozilla.org/comm-central/rev/766214fa44ab
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 13.0
You need to log in before you can comment on or make changes to this bug.