Closed Bug 16800 Opened 25 years ago Closed 25 years ago

Improve HTML -> plain text

Categories

(MailNews Core :: Composition, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: BenB, Assigned: BenB)

References

Details

Attachments

(5 files)

Convert HTML's <strong>, <em> and <a href="..."> to *bold*, _italic_ and
<URL:...>

This bug was splitted off bug #16507, see description and early comments there.
The best place to add these would be in nsHTMLToTXTSinkStream::OpenContainer()
and CloseContainer().  See what it currently does for list items (eHTMLTag_li):
you can call Write() with one character on opening the tag, and again (same or
different character) on closing the tag.

<b> and <i> should also work, since a lot of people use those tags instead of
<strong> and <em>.

You can assign this bug to me if you want, and I'll do it when I find time;
adding the text types shouldn't take any time (I can do that in about 15 minutes
plus testing); the URLs are a bit harder, as I tried to do that once before and
ran into a snag of some sort.

I'd like to argue about the URL format, though.  What I was thinking about doing
for URLs was to convert <a href="foo">bar</a> into bar (foo).  If it converted
into URL:foo, then I'd no longer be able to use double-click to select it in a
standard plaintext window, which would be a major pain (I do that many times a
day) -- I'd have to target the text after the colon, then carefully drag to the
end of the word, instead of just targeting any part of the word and
double-clicking.
Akk,

there's some discussion about the form going on on n.p.m.mail-news in thread
"Assuming "plain text" or "html mail"..", starting at
news://news.mozilla.org/38054F6C.8624442@bucksch.org.

Brenden pointed to RFC1738 ftp://venera.isi.edu/in-notes/rfc1738.txt, Appendix
(Page 21), where <URL:...> is recommended.
Reading rfc1738, it does not seem to be absolutely required that no space
follow the "<URL:" in <URL:foo://bar>. While the string "<URL:" is never
followed by anything other than the first letter of the URL scheme anywhere
in the RFC text, the only part of the RFC that describes the use of
<URL:foo://bar>, the Appendix, does not explicitly rule out the use of
<URL: foo://bar> in the same context.

Here is the relevant passage:
"In addition, there are many occasions when URLs are included in other
kinds of text; examples include electronic mail, USENET news
messages, or printed on paper. In such cases, it is convenient to
have a separate syntactic wrapper that delimits the URL and separates
it from the rest of the text, and in particular from punctuation
marks that might be mistaken for part of the URL. For this purpose, is
recommended [sic] that angle brackets "<" and ">"), along with the
prefix "URL:", be used to delimit the boundaries of the URL. This
wrapper does not form part of the URL and should not be used in
contexts in which delimiters are already specified."

From this context it seems clear that this syntax is wanted for
human parsing, not machine parsing. The use of spaces inside the angle
brackets is neither encouraged nor disallowed.

On this basis, would it be possible to have our cake and
follow the RFC too by using <URL: foo://bar> instead of <URL:foo://bar>?

This would address the concerns akkana@netscape.com raised about being
able to click on URLs in plain-text (especially useful for e-mail).
sidr, please post this to n.p.m.mailnews to the thread I've mentioned.
Trying every Windows mailer immediately available to me on NT
(Messenger, Eudora, Outlook Express), all seem to make working
links out of <URL:foo://bar>, so the objection akkana@netscape.com
raises seems to be theoretical. My response proposing <URL: foo://bar>
(with the space after "URL:") should be treated likewise.
Not theoretical, but something I do many times a day.  You can set up xterms so
that colons are considered part of words, but you can't set them up so that
http: is part of a word but URL: is not.

What's the benefit of the URL: part?  I've seen it here and there, but I've
never seen an application that requires it or does anything useful with it.
Most applications I've seen already recognize http:// as a URL and don't need
any further prefix.
Ok, it's not theoretical. I did ask in n.p.m.mailnews if anyone had
experience with X MUAs, but I'm sure both messages passed each other
in the mails :-) . In that case, would the "adapted" rfc1738
"compliant" <URL: foo://bar> work, or would the ">" at the end cause
problems too? I suppose <URL: foo://bar > would fix that, but that
is starting to look ugly.

The point, as explained by TimBL et al in the appendix of rfc1738,
is to make it easier for a human to tell the boundaries of a URL
in plain-text writing. An example that might be seen in a mail message:
"Have you tried <URL:http://www.foobar.nosuchtld/something.cgi>?" -
without the angle-bracket delimiters, the "?" could look like part
of the URL. Other punctuation is just as bad, it's easy to miss
while copy-n-pasting. The other area where it helps is in URLs
that get broken over lines.
The > doesn't cause a problem; since it isn't a legal charcter in a URL, it's
easy to filter that out of the "word" definition.

Personally, I think URL: is ugly and don't see why that part is necessary (I do
understand wanting the angle brackets, fot the reasons you gave) but if it was
separated from the actual URL by a space, I wouldn't have a functional objection
to it, just the aesthetic one.
Akk,
I don't want to be ignorant, but if we don't follow the RFC recommondation and
create our own scheme (one more suggestion: <foo://bar>), I want to be sure, we
have a real reason.
I still don't understand your problem. Do you read your mail using a text based
mailer? How often (applications, users) does the problem appear?
*** Bug 16958 has been marked as a duplicate of this bug. ***
Status: NEW → ASSIGNED
Depends on: 17641
*** Bug 12969 has been marked as a duplicate of this bug. ***
*** Bug 12969 has been marked as a duplicate of this bug. ***
This HTML->plain text conversion takes place not only when I send an HTML message
to a plain-text recipient, but also when I copy some text from a Web page and
paste it into a plain-text message, or into Notepad, or wherever, right?
Yes and no: yes, it goes through the same output converters, but no, it goes
through with different flags (unformatted for copy/paste rather than formatted
for mail) so this sort of conversion won't be done on copy/paste.  Most people
just want the raw text and nothing else when copy/pasting plain text.
Depends on: 17723
Matty,
could you please explain, what you mean with your comment on bug #16507?

"Regarding translating em and strong to plain text, I'm not really sure, but it
would be really nice to use a stylesheet. That sounds quite complicated though."

I don't know, how we could use stylesheet to do a conversion. As I know them,
they just apply attributes to SGML or XML tags. There're no such tags in plain
text msgs.
Now, I'm completely confused :-). We have HTML of XIF as input here, on which we
can apply stylesheets. But I still don't know, how to use that for a conversion
to TXT.
I was referring to looking at the stylesheet, seeing that em maps to bold, and
hence putting the plain text version in stars.  But this assumes that stars
means bold, which may or may not be the case.
The changes fix indention, ul and ol, nested lists and structured pharses and
link, of course.
Unfortunately, the diff's won't apply after the checkin of format=flowed.
Nevertheless, I attached them, because the changes might the useful when fixing
still remaining bugs in the new files.
The syntax of nsString usage depends on bug #17641, but may be changed easily
when 17641 doesn't make it in the tree.
Depends on: 17883
Summary: HTML -> plain text: strong, em, a → Improve HTML -> plain text
I'll try to make lists working, too.
The patches are for version 3.19.
Apart from support for a,img,em,i,strong,b, and sup (didn't know a plaintext
equivalent to sub), they implement indention for lists, nested lists and improve
the list bullets/numbers.
The latter has still a minor problem (too many spaces), I'll send a new version
of the patches when 17883 is fixed.
The patch looks good.  We obviously need to make the indentation configurable; I
made it 2 because that's the mozilla standard for code (not that that has
anything to do with plaintext output); 4 would be my personal preference, 8
seems too much, but we should make it a preference so everyone can decide.  I
can add that later.

On writing out the title in SetTitle: I don't think that's the right thing to do
in plaintext, because it isn't something that's shown to the user when looking
at the html page, and the goal of formatted plaintext output is to make the
plaintext look as much as possible as the html looked to the user.  So my
inclination is to remove that part unless you can make a case for why we should
be outputting the title.

I still hate the URL: without a space since it means I can't select words by
double-clicking, but I'll try to live with it for a while and see how much of an
annoyance it turns out to be.

Why did you remove the stripping of quotes from
value.StripChars("\"").Equals("cite", PR_TRUE)?  I had cases where it didn't
work due to that (you can't always count on the quotes being there or not), so I
think this needs to be there unless you know of some change that's happened
recently that will guarantee quotes not being in the attribute string (which
would be nice).

Otherwise, it looks good, and we should take this whenever the tree opens for
checkins.
Akk,

The indention is a leftover from my testing. My preference would be 4, too. I
agree, that this should be configurable. I'll try to build it in my final patch.

SetTitle:
Titles carry *very* important information, that isn't repeated in the body, in
many of my HTML files.

I do not agree, that plaintext should look like a browser output. Not only that
HTML isn't about the look, plaintext has it's own way of representing meanings.
What I think, we should do, is trying to take the information the HTML file
carries and produce a plaintext document how it would look like, if the document
were created directly in plaintext.

Of course, it doesn't make sense to print it in this form in an email. I forgot
to add a test for emptyness, in which the title isn't written. Can we figure out
a system, that works for both (are there more?) cases? Maybe defaulting the
title to zero-length?

URL:
You didn't answer my questions (see above) yet, so I had to decide. It extremly
simple to change (just search for occurances of "URL:", note the plural).

StripChars:
I don't remember, how I came to the conclusion, that this isn't needed. I think,
I made a test with a node with quotes and |value| didnÄt contain them. If you
have problems, let it in.
Status update: I'll see, what I can do for tables now.
Status update: I'll see, what I can do for tables now.

(My mail bounces at the moment :-((. Until fixing, send mail to
ben.bucksch@uumail.de.)
I think the presence or absence of quotes is affected by whether we're parsing
from html or from XIF.

The answer to the question about URL: without the space is in my comment about
xterms -- it applies to any app (including a mail program) which runs in an
xterm.

I have a patch which integrates Ben's work, a quoting fix from Daniel, and some
patches of my own into one, which I'm attaching to this bug.
Oh, forgot to address the SetTitle issue: I was under the impression that most
web pages (certainly mine are like this) basically repeat the title information
in an H1 or similar at the top of the page.  So if we output the title, the
output will show two identical or nearly identical lines, one of which wasn't
visible at all to the person viewing the html page.  Title is meta-information
which, in my o, doesn't belong in the output.  Any lurkers on this bug want to
add their opinions?
For everybody not wanting to analyse the patch: Akks changes are including a
space after "URL:", ifdef'ing SetTitle, reincluding StripChars, 2 fixes for
"VerticalSpace" and adding support for <hr>.

The latter should be changed. Using wrap width should be easy and done (she
expressed intention to do so) and just a "Write" will do the wrong think from my
understanding of Daniel's new functions. You should make sure, the hr is on its
own line.
Akk, it would be nice, if we could carry discussion about opinions out to the
newsgroup, they provide better structure for that (e.g. threading). This page is
already long enough. I suggest mail-news.
Opened bug #18012 to avoid a superbug.
No longer depends on: 17641
Shaver seems to have been right. RFC1855
<URL:ftp://venera.isi.edu/in-notes/rfc1855.txt>: "Use underscores for
underlining."
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
No longer depends on: 17723
QA Contact: lchiang → asj
Severity: normal → enhancement
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: