Closed Bug 650787 Opened 13 years ago Closed 12 years ago

nsPlainTextSerializer: Avoid dependencies on the old HTML parser when converting HTML to plain text

Categories

(MailNews Core :: MIME, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 13.0

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 4 obsolete files)

It appears that comm-central has multiple places where the old HTML parser is used with nsPlainTextSerializer as a sink to convert HTML to plain text.

The old parser and the sink interfaces are going to go away, so these conversions in mailnews core should use the new parser and a plain text serializer that takes a DOM (sub)tree instead of being a parser sink. Code consolidation with bug 650784 would make sense.
Blocks: 650775
I strongly recommend against a reimplementation. nsPlainTextSerializer is very involved and delicate code. It took a lot of fine-tuning to get it right in all cases: copy-paste, normal plaintext, and format=flowed. Each of these by itself has totally non-obvious edge cases that needed to be considered. Rewriting this is bound to cause a lot of regressions.

I suggest to simply move this implementation to the new APIs, whatever they are.
Summary: Avoid dependencies on the old HTML parser when converting HTML to plain text → nsPlainTextSerializer: Avoid dependencies on the old HTML parser when converting HTML to plain text
(In reply to comment #1)
> I suggest to simply move this implementation to the new APIs, whatever they
> are.

nsPlainTextSerializer seems to inherit from nsIContentSerializer already. Does it already actually work as an nsIContentSerializer impl? If it does, I suggest simply dropping support for the nsIHTMLContentSink and nsIHTMLToTextSink interfaces and using nsPlainTextSerializer only as an nsIContentSerializer implementation.
> Does it already actually work as an nsIContentSerializer impl?

I don't think anybody knows, and I think you're best suited to find out.
I am having some success reimplementing nsPlainTextSerializer by implementing it as a specialization of nsHTMLContentSerializer. Although it may seem counter intuitive nsHTMLContentSerializer has most of the code already written. I'm just modifying it to fit ie strip out the appending of the open and close tags. also the newer implementation is much more elegant and simplifies the coding significantly.

It would help me if someone could write some test cases involving lists and tables or just more through tests in general.
Also I am wondering whether I should Ignore canvas tags in the newer code?
Blocks: 563890
William can you post your patch in progress.
Blocks: 563890
(In reply to william bradshaw from comment #5)
> Blocks: 563890

Why?
(In reply to Henri Sivonen (:hsivonen) from comment #7)
> (In reply to william bradshaw from comment #5)
> > Blocks: 563890
> 
> Why?.

Missed the differance of the :I" thought nsPlainTextSerializer was an nsHTMLContentSink it's not it's a nsIHTMLContentSink sorry about that my bad. I'll post my in progress patch latter tonight when I get back to my dorm.
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> (In reply to comment #1)
> > I suggest to simply move this implementation to the new APIs, whatever they
> > are.
> 
> nsPlainTextSerializer seems to inherit from nsIContentSerializer already.
> Does it already actually work as an nsIContentSerializer impl? If it does, I
> suggest simply dropping support for the nsIHTMLContentSink and
> nsIHTMLToTextSink interfaces and using nsPlainTextSerializer only as an
> nsIContentSerializer implementation.

<http://www.tjhsst.edu/~jcranmer/c-ccov/content/base/src/nsPlainTextSerializer.cpp.gcov.html> seems to suggest that it is not tested as such (there may be stuff in mochitests/reftests/etc., but I doubt it). Given how poorly tested the code is, I am not sure I would trust just flipping the impl use and seeing if it works.
No longer blocks: 563890
william bradshaw, thanks for looking into this.

> also the newer implementation is much more elegant
> and simplifies the coding significantly.

When changing the implementation *logic* (not just APIs), please be super careful. This class was fine-tuned for many use–cases, including Thunderbird plaintext mail send, format=flowed, clipboard paste from browser, again in two variants: rich and simple text, and no linebreaks. Also test pasting of URLs.

This is very prone to regressions in edge cases, so I'm very vary of re-implementations or any changes to the logic whatsoever in fact.
I think I've found that my success in implementing nsPlainTextSerializer is limited to marginal success and am finding the deeper I go the more things don't work as I thought they did.  I'm now siding with the not reimplementing crowd. Although it seems the idea seems to work on a high level small details of the way functions are implemented would cause me modify virtually every function of nsHTMLSerializer to attain full behavioral recreation. Although at this time (for time saving purposes) just moving to new API's is looking rather appealing to me. I would however advocate Bug 650784 or changing implementation to more closely follow the model set up by the nsXMLSerializer class tree to provide a more consistent code base.
Attached patch PlainTextSerializer as HTML (obsolete) — Splinter Review
Attachment #562637 - Attachment is patch: true
Depends on: 102699
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
This need parts 1 and 1.5 from bug 650784 on m-c.
Attachment #562637 - Attachment is obsolete: true
Attachment #591759 - Flags: review?(mbanner)
Comment on attachment 591759 [details] [diff] [review]
Use nsIScriptableHTMLParser for HTML to plain text conversion

Hmm. This should probably use do_GetService instead of do_CreateInstance.
Attachment #591759 - Flags: review?(mbanner)
Attachment #591759 - Attachment is obsolete: true
Attachment #601225 - Flags: review?(dbienvenu)
Blocks: 650776
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

Review of attachment 601225 [details] [diff] [review]:
-----------------------------------------------------------------

::: suite/installer/package-manifest.in
@@ +201,5 @@
>  @BINPATH@/components/filepicker.xpt
>  #endif
>  @BINPATH@/components/find.xpt
>  @BINPATH@/components/gfx.xpt
> +@BINPATH@/components/html5.xpt

Ah, I just did it in bug 731668 :-|
Depends on: 368758
Blocks: 732070
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

>diff --git a/mail/installer/package-manifest.in b/mail/installer/package-manifest.in
>+@BINPATH@/components/html5.xpt

Due to bug 732070, I've landed this part of the patch with r=me. I'll let David look at the rest later.
Comment on attachment 601225 [details] [diff] [review]
Use nsIParseUtils for plain text conversion, with .xpt additions to packaging

The changeset was: http://hg.mozilla.org/comm-central/rev/8eca2a2319d8

(Sorry, I forgot to correct the username after extracting that)
Depends on: 732691
Attachment #601225 - Attachment is obsolete: true
Attachment #601225 - Flags: review?(dbienvenu)
Attachment #603650 - Flags: review?(dbienvenu)
Attachment #603650 - Flags: review?(dbienvenu) → review+
Thanks for the r+. Landed:
http://hg.mozilla.org/comm-central/rev/766214fa44ab
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 13.0
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: