Closed Bug 16507 Opened 25 years ago Closed 25 years ago

Improve Plain text -> HTML

Categories

(MailNews Core :: MIME, enhancement, P3)

enhancement

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: BenB, Assigned: BenB)

References

()

Details

Attachments

(7 files)

*bold* <-> <strong>
_italic_ <-> <em>
URL <URL:ftp://venera.isi.edu/in-notes/rfc1738.txt> <-> <a
href="ftp://venera.isi.edu/in-notes/rfc1738.txt">URL</a>
Assignee: ducarroz → mozilla
Status: NEW → ASSIGNED
Target Milestone: M15
I don't know if we will support directly those formating command during composition but we should be able to
display them correctly. About the URLs, if I am not wrong, we already support it during the display of a message.
Yes, URLs are clickable (although the <URL: and > still appears).

Why not support during composition? (Assuming composition is done via HTML
editor and then converted to plain text.) We could stop data loss.

At the moment, I'm just "evaluating" (browsing through the code).
Ben, are you talking about these improvements in the context of reading a plain
text message which has *foobar* and then generating <b>foobar</b> for display
purposes? If so, rhp@netscape.com is the right owner for the bug.

Or are you talking about doing something when composing a message? If so, I'm
not clear what your suggesting.

In any case, I don't think we need to do any more with URLs than we do. We
recognize them just fine without the extra syntax.
Both. The first case together with your metion of URLs catches the part plain
text -> HTML.

The conversion after composition from HTML -> plain text e.g. looses URLs and
other formatting (URLs being the worst).

Let's <em>say</em> I <strong>composed</strong> this <a
href="http://www.mozilla.org">message</a>

and decide to send m/a. The plain text part looks like this:

Let's say I composed this message

but it should look like this:

Let's _say_ I *composed* this message <URL:http://www.mozilla.org>

I assigned it to me, because I wanted to see, what I can do. If you want it to
be implemented soon, assign it to rhp.
Perhaps also foo^h^h^hbar in plain text could displayed as
<strike>foo</strike>bar?
rhp,

could you please review my patch and check it in?

Please review very very carefully. I'm sure, there're still all kinds of bugs.
At least, it's egcs proven and seems to work on Linux. (I have no licence for VC
:-(.)

Deleteme are tmp. comments for you.
This all sounds like good stuff. I completely agree with Ben's 10/15 comments
that sending <a> links through plain text should include the URL.

I'm cc'ing rhp since he can probably suggest places where reading plain text and
generating HTML could be improved (as with the smiley face for :-) in mozilla)
and akkana since she can probably suggest places to improve outputting the
editor content model as plain text.

Ben, if I were you, I might split this bug up into several smaller ones, but
it's your call.
I always thought it would be _underline_, *bold*, /italic/.  Isn't that the
traditional way?

(I'd love to see |code| work, too.)
Yes, the ability to do this type of formatting/text recognition is in the code
today and was enhanced from 4.x to 5.0. In 5.0, plain text URL's that are not
prefaced with the protocol (i.e. www.netscape.com) will be recognized as URL's.
Also, email address that are just typed as rhp@netscape.com will also be
link-a-fied.

Something I should point out is that even plain text mail display is being done
with an HTML capable rendering engine, so you can tweak the text however you
want with HTML tags and the display will do the right thing.

While I was doing that, I played with having emoticons display as the image
they are supposed to represent. So :-) got replaced with a little smiley face.
Of course, "purists" told me I was ruining the Internet so that is why I put it
on a preference setting.

The code to do this is somewhat isolated, but when I get some spare time (ha,
ha, ha, ha...ok, I'm done) I wanted to make this thing truly extensible. I
would love to be able to have an interface that would let you do whatever you
wanted to do to the output before display.

The code of interest here is in the file:

   http://lxr.mozilla.org/mozilla/source/mailnews/mime/src/nsMimeURLUtils.cpp

Look at the function:

        nsresult        nsMimeURLUtils::ScanForURLs()

and you can see what is going on.

Enjoy!

- rhp
Adding myself to cc list.  I'm puzzled why this is in libmime -- shouldn't it
live in the normal output methods, in nsHTMLToTXTSinkStream.?  Does libmime do
its own output conversion?

It would be pretty easy to add these conversions to nsHTMLToTXTSinkStream.
Akkana, I think the reading side is in libmime, but the writing side uses your
output stuff.
The rendering side of this lives in libmime. We also do some basic link-a-fying
in the compose back end for when you type http://www.netscape.com into an HTML
compose window, but don't acually create the link.

- rhp
Does anybody know a RFC that can second/deny Mike's comment?

|code| would be very easy to implement. But I need more info about usage, I've
never seen that. Is this used for vars or code fragments? Are they aligned in
blocks? I build in many security proofs, so none of the following would be
converted:
|code;| |<code>|
|code
code|
But I could change this, if someone cann tell me more details, or, even better,
a spec.
Component: Composition → MIME
Summary: Support *bold*, _italic_ and URLs in plain text → Plain text -> HTML: *bold*, _italic_ and URLs
Obviously, we need to splitt HTML -> plain text conversion off. Created bug
#16800.

Summary changed. Not sure what Component "Networking-Mail" means, choosed MIME.
It's just a cosmetic change, but it would be nice, if <URL:...> would be
converted to just <a href="...">...</a>, not &lt;URL:<a href="...">...</a>&gt;.
Eventually, this will have to be tested.  Is there a spec or something that we
can follow and write testcases for so that we get good coverage on this feature?
lchiang,

"Spec" is in the source as comment :-):

*Bold* -> <strong>
DELIMITER: not alphanumeric and not "*"
We're searching for the following pattern:
DELIMITER - "*" - ALPHA -
[ some text (maybe more "*"-pairs) - ALPHA ] "*" - DELIMITER.
<strong> is only inserted, if existance of a pair could be verified
Same for _italic_ -> <em>

This is generally used to *stress* some word or *some phrase*, both cases should
be covered, many others * bold * are excluded by intend not be be triggered by
"5 * 3 * 4 = 60" (savety first).

Providing test cases would make QA somewhat useless. (My own test cases work, of
course.)
Not sure what HTML you'd generate for |code|. Since we're reading a text/plain
message, it would already be rendered in a monospace font. Not much point
in wrapping it in a <pre>. shaver, did you have something in mind?

Lisa, I think this is testable in pretty much the same way as colorizing quoted
material in plain text messages:
1. Send yourself a plain text message with *foo* /bar/ _baz_
2. Read the message, and note that foo is bold, bar is italicized, and baz is
underlined.

Further capabilities (like what we do with |code| or <URL:xxx> TBD, I think.
lchaing,

I just saw, *you* wanted to write testcases. Sorry, misuderstood you.

BTW: "[ something ]" means "something" is optional.
Is the Spec clear enough?
Phil,

<code> come to my mind :-):
<URL:http://www.w3.org/TR/REC-html40/struct/text.html#h-9.2.1>.

Plaintext is not neccessary rendered as monospaced (at least in 4.x). I'm
reading in a proportional font (screwing up tables and ASCII-art :-( ).

But even if display is monospaced, <code> should be rendered differently to
distinguish it from (prosa) text.
(Thanks - I will review all this next week)
See <URL:news://news.mozilla.org/380D04D9.4D86E941@bucksch.org> ("ASCII-art
detection" under "Assuming "plain text" or "html mail".." at n.p.m.mail-news
from 19 Oct 99 23:55:05 GMT) for ASCII-art proposal.
I agree with Shaver, /italic/ and _underline_ are frequently done this way.



I know of an amiga newsreader that does this, should I find out what all it does?
Planb,
see shaver's comment on bug 16800. But a RFC or at least an Internet-Draft would
be really helpful, I couldn't find any mention.
Usual warnings apply, this time especially regarding the passing of nsString
between functions (leaks etc.). Again: I'm unable to take any responsibility for
the code :-(.

/italic/ works now.
_underline_ is transformed to <em>, since <u> is deprecated, I would have to use
stylesheets. Any ideas?
|code| is commented out, because it is invisible in monospaced viewers and I
remove the "|". It also works the same as *bold*, need more info (see my notes
above).
RichP, would you code review this please?
Sorry for the delay in this review. Looks good to me!

- rhp
Target Milestone: M15 → M11
Cool. Marked M11.

Need suggestions for ascii-art detection, see
news://news.mozilla.org/380D04D9.4D86E941@bucksch.org or
http://www.deja.com/msgid.xp?MID=<380D04D9.4D86E941@bucksch.org>
and it's reply. BTW: I just noticed, dejanews uses "<" and ">" in URLs. Nice.

|StructPhraseHit(nsCAutoString text, PRBool col0, ...|
should better be
|StructPhraseHit(const nsCAutoString text, PRBool col0, ...|
Assignee: mozilla → rhp
Status: ASSIGNED → NEW
Assigning to rhp, so he can check it in.
Assignee: rhp → mozilla
Hi Ben,
I am really suspect of the Right() call. I don't understand why you don't get
garbage on return. If you do get a valid string returned, then the
nsCAutoString is returning an allocated string, which means we are leaking. The
problem is there are tons of string classes so I am unsure of the exact
behavior.

I would probably return a newly allocated string and free it on the calling
side. This may be what nsCAutoString is doing (without the free), but I'm not
sure. I know that nsString.ToNewCString() will do this, and then you have to
free the memory.

- rhp
Rich,

I did some research and everything is like I hoped it would be :-). I love C++.

- return copies
Objects are returned by invoking the copy constructor (1997 C++ Public Review
Document, Section [class.copy],
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/special.html#class.copy>).
This is the reason, why I "don't get garbage".
[stmt.return]
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/stmt.html#stmt.return>
- Destruction on out of scope
If an (automatic) object falls out of scope, the destructor is called.
[class.dtor], paragraph 10, case 2
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/special.html#class.dtor>
- Example
An example of my usage of objects is at [class.temporary], Paragraph 2
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/special.html#class.dtor>

- |ns*AutoString|s free the memory at destruction.
"The point of nsAutoStrings is [...] to auto-destroy the string when it goes out
of scope." (<URL:http://lxr.mozilla.org/seamonkey/source/xpcom/ds/nsStr.h#132>)
If my understanding of |ns*String| is correct, all |ns*String|s free the memory
at destruction, if they own it (see
<URL:http://lxr.mozilla.org/seamonkey/source/xpcom/ds/nsString.cpp#137> and
<URL:http://lxr.mozilla.org/seamonkey/source/xpcom/ds/nsStr.h#239>).

Reassigning to me, since checkin is done.
Status: NEW → ASSIGNED
Sorry, the link for the example is wrong. The correct one is:
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/special.html#class.temporary>
huftis,
I didn't forget your question^H^H^H^H^H^H^Hproposal, but it will be hard to
implement, because the code walks char by char through the msg. The other plain
text tags enclose the phrase like HTML tags do, so I could just substitute. I
think, /I/ won't implement that^H^H^H^Hyour proposal.
> huftis,
> I didn't forget your question^H^H^H^H^H^H^Hproposal,
> but it will be hard to implement. I think, /I/ won't
> implement that^H^H^H^Hyour proposal.

OK, but how about character substitution. Example:

=>          -->   U+21D2
--> or ->   -->   U+2192

And perhaps even:
^2     -->   U+00B2
1/2    -->   U+00BD
(C)    -->   U+00A9
huftis,
the following strings are not substituted:
|TXT   |HTML     |Reason
+------+---------+----------
 ->     &larr;    Char not displayed on Linux (not even a placeholder)*
 =>     &lArr;    dito
 <-     &rarr;    dito
 <=     &rArr;    dito
 (tm)   &trade;   dito
 1/4    &frac14;  is triggered by 1/4 Part 1, 2/4 Part 2, ...
 3/4    &frac34;  dito
 1/2    &frac12;  similar
 !=     &ne;        used in C/C++(-pseudo)-code
 <=     &le;        dito
 ...    ...         dito
+------+---------+------------
*I'd like to know why.
I'm substituting "(c)", "(r)" and "+/-" (using rhp's glyph substitution code),
but I'm not even sure, if the signs for these display correctly on all platforms
(tested only on Linux).
You might be interested in
<URL:http://www.w3.org/TR/REC-html40/sgml/entities.html>.

rhp,
could you please review that and check it in? Tnx.

QA,
Test this (all my patches) with wild and unusual test cases. Every substitution,
where it shouldn't be, is a bug. File it against me.
Summary: Plain text -> HTML: *bold*, _italic_ and URLs → Improve Plain text -> HTML
Oh, I forgot the best :-): Exponents are <sup>'ed.

Changing Summary.
Wouldn't it be better if ^5 was substituted with &#x2075; (U+2075). If the font
didn't contain that glyph, it could *then* be converted to <SUP>5</SUP> (see
BUG #12662 <URL:http://bugzilla.mozilla.org/show_bug.cgi?id=12662>).
I've filed a couple of bugs on some of these entities which aren't displayed in
Linux (&bull; is another one).  Bug 454 seems to be the main bug concerning
these (currently marked as TRIVIAL so I wonder if we're going to be stuck with
this bug forever); 5383 concerned &trade; but was duped to 454; 16872 is another
one on &bull specifically (which might be a different issue since it can be done
in gfx instead of requiring a font that has those characters).
Depends on: 454
If _a_ means underline, you should use underline.  If you don't want to use <u>,
then use <span> and CSS.

It's certainly better to have HTML be document markup rather than
presentational, but these ARE distinctly presentational, and I think they should
be rendered presentationally.  The same for italic - making it em seems wrong to
me.
Matty,
what is the plain text equivalent to <em>? I usually don't want to underline
something, I want to stress something in different levels. I think, <u> is
deprecated for a good reason.
I would suggest that *emphasized text* is used more to indicate emphasis (i.e.
<em>) than to indicate bold.  I would expect *starred text to be in italic* (or
whatever is being used for <em> and ALL_CAPS text to be in BOLD.

I agree that _underlined_text_ should map to <u> since that ascii construct is
very specific (and awkward to type, so no one would use it unless they really do
mean underline).
*sigh*

The only thing, on which we all agreed till now, were that *stars* mean bold.
I use _this_ usually to emphasize something (but not as much as *bold*), having
italic in mind. So I _do_ use it and do not mean underline. And I think, others
do, too.

We can't use <u>, because it's deprecated; we would have to add a stylesheet.

My personal opinion is: I want presentational layout to die. And underline is an
ugly looking leftover from the times of typesetting machines, where there were
no other methods to stress something.
What, if I add a pref "mail.do_struct_phrase_presentational" defaulting to
FALSE, that maps *bold* to <b>, /italic/ to <i> and _underline_ to underlining
based on stylesheets?

(We already have an "mail.do_struct_phrase" (and "mail.do_glyph_substitution"
BTW). Maybe, we could compensate them and change the names.)
Ben, you shouldn't use <B> any more than <U>.  It may not be deprecated, but
it's still presentational, and hopefully it will be deprecated in future.  I
think you can use a span and have a style attribute that will allow you to do
all with CSS.

Regarding translating em and strong to plain text, I'm not really sure, but it
would be really nice to use a stylesheet.  That sounds quite complicated
though.  Maybe just use bold.

I guess it seems they aren't used consistently, in which case your original
mappings make sense.  This way at least the user can edit their plaintext
stylesheet to reflect how they want messages displayed.
Matty,
a stlyesheet for all formatting sounds like a good idea to me. Unfortunately, I
never really worked with stylesheets. May take some time till code follows.

HTML -> plain text is offtopic here now, see bug #16800. I don't know, how you
want to use a stylesheet for HTML -> plain text conversion, but this is too late
anyway, I think (unless this is a really good idea), because I already have
working code. I just need to make last checks.
My two cents, as maintainer of the news:alt.ascii-art FAQ
(http://cantua.canterbury.ac.nz/~mpt26/art/ascii/faq/):

(1) ASCII artists (and others who use ASCII art sigs, etc) are going to be *
extremely* annoyed if the formatting-detection algorithm guesses wrong. And it *
will* guess wrong occasionally, no matter how good it is. (What happens, for
example, if I insert a /*C comment*/ ...) ASCII artists like Mozilla (in its
current incarnation as Netscape Messenger) because, unlike MS Outlook, it leaves
ASCII art alone. And they'd rather it stayed that way.

As a compromise, I would suggest doing it the way (IIRC) XEmacs' mail reader does
it. That is, apply the formatting, but *leave the special characters there*
(perhaps dim them, but leave them there). That way things won't get too
mangled if the algorithm guesses wrong.

(2) When I use /slashes/, sometimes I mean <em>, and sometimes I mean <cite>. You
can't know which I mean, because text isn't structural markup. So I see no option
except to use CSS italics. Similarly for *asterisks*, you can't know whether I
mean <strong> or <vector-space> or whatever, so CSS bold is really the only
option.

(3) If URLs without protocols are going to be detected as http addresses, as
rhp@netscape.com suggests, surely the wrong thing is going to happen for these:
* you can download this by anonymous FTP at foo.bar.net
* to begin, telnet to library.canterbury.ac.nz and log in as "guest"
* you are cordially invited to mozilla.party three.oh

And please don't tell me you're just going to do it for domains starting with
`www.', or I'll scream.

-- mpt (http://critique.net.nz/ -- not a www. in sight)
What he said.  I regularly send stuff like this:

  ``we should just prune entries matching /(mozilla\.org|netscape\.com)$/''

  ``go into your srcdir and rm *TitledButton*, then update your tree''

and I mean neither italics nor emphasis.

DWIM is very hard to get right.
mpt and Mike,

/(mozilla\.org|netscape\.com)$/'' would not be changed as described to lchiang.
But rm *TitledButton* and /*This comment*/ would. I'll let the plaintext tags in
the text.

mpt,

what is bad with em tags (maybe with a type attribute), if you have a
stylesheet?

I think, disabled people wouldn't like CSS bold. (Would they like us to leave
plain text tags in? Maybe one more pref? :-( )
Agreed that the characters should not disappear.  Disability issues should go
away then, and there's always the chance to apply a stylesheet to a plaintext
message.

What I was referring to about stylesheets on plaintext messages applies both to
this, and it would also be nice to turn quoting into using <BLOCKQUOTE> (do we
do this already?)

Sorry about being offtopic, I got confused for a minute.
matty, sorry, I don't understand anything, what you're saying, but maybe I'm
just too tired.
I don't see any explanation of why /[abcdef0123456789]/ wouldn't get ``fixed''
to <i>regex</i> above.  Can you elaborate?
Mike,

the explanation in question is (to make sure, we're speaking about the same
thing) for *bold* -> <strong>:
DELIMITER: not alphanumeric and not "*"
We're searching for the following pattern:
DELIMITER - "*" - ALPHA -
[ some text (maybe more "*"-pairs) - ALPHA ] "*" - DELIMITER.
<strong> is only inserted, if existance of a pair could be verified

What do you meman with "/[abcdef0123456789]/"? Should *I* evaluate that or take
it as string? If this or "/(mozilla\.org|netscape\.com)$/" appears exactly that
way (or without the quotes) in the msg, my code would leave it. But if you mean,
if "/abc[abcdef0123456789]def/" (with or without the quotes) would be changed,
the answer is yes. (But neither "/abc[abcdef0123456789]789/" nor
"/abc[abcdef0123456789]/" would be changed.)
It's not yet clear, what "change" means. At the moment, "/" is substitued with
"<em>". But as you pointed out, it would do the wrong thing for "rm *diff*". The
only solution I see is, as mta suggested, to let the plain text tags in and dim
them.
Mikes idea of content-before in CSS to readd the TXT tags sounds very nice, but
I'm not sure, what would happen, if we reply with HTML. The TXT tags *should*
still be there, even if the recipient uses a non-CSS-capable HTML viewer.
mpt,
start to scream:
<URL:http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/nsMimeURLUtils.cpp#323>
But I don't know, what so wrong with guessing www.bucksch.org would really be a
reference to http://www.bucksch.org.
Matty, I still don't understand your comment.

I think, when we /remove/ the plaintext tags, this will help disabled persons (
how will "..we slash remove slash the.." sound?). The interface (whatever this
may be) can do, what it thinks is best with tags like em and strong, and they're
well-known. This is one of the reasons why I vote for <em type=txt_italic> or
similar and not any "font-style: italic".

I never heard of stylesheets for plain text. What is that?
A stylesheet for plaintext messages, as in, by the time you do Plain Text ->
HTML, it will be styleable.

Your comment about disability is true, but for all users, the potential loss of
characters is too great a chance no matter what scheme you adopt.
Target Milestone: M11 → M12
any fix in hand for this? moving to m12.
move back if its ready to resolve in the next day or so.
I'll try to get rid of the hardcoded quote formatting, too, and use the
stylesheet.
*** mozilla@bucksh.org said,
> `what is bad with em tags (maybe with a type attribute), if you have a
> stylesheet?'.

What is bad with them is that:

- you'd convert "I saw /Gone with the wind/" to "I saw <em>Gone with the
  wind</em>", when it should be "I saw <cite>Gone with the wind</cite>"

- you'd convert "some random value of /x/" to "some random value of <em>x</em>",
  when it should be "some random value of <var>x</var>".

See? /slashing/ is presentational content. You can't tell which semantic thingy I
mean by it, so the only honourable thing to do is to go <span style="font-style:
italic"></span> (or whatever) instead. The same with *asterisking* as either <
strong> or <vectorspace> (or whatever the MathML tag for a vector space is). You
have to do <span style="font-weight: bold"></span>.

I'm an ardent defender of the Internet rights of disabled people, but you have to
give up the semantic content in this case, simply because you don't know which
semantic content was intended. This is plain text we're dealing with, remember,
so it's not as if we're making things any worse. (Are you going to try to turn >s
into <blockquote class="cite"></blockquote>, for example? That would be similarly
confusing for disabled people ...)

*** mozilla@bucksh.org wrote:
> mpt, start to scream:
> <URL:http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/nsMimeURLUtils.cpp#323>
>
> But I don't know, what so wrong with guessing www.bucksch.org would really be
> a reference to http://www.bucksch.org.

Aaaaaaaaaaaaargh! (Further screaming available on request.) Because it will lead
people to assume that if something doesn't start with `www.', it's not a Web
addresss. It's not going to highlight cnn.com, or slashdot.org, or
home.netscape.com, for example. That sucks lots.

*** I've been drawing some ASCII art (a mockup of a new Mozilla prefs dialog,
actually), and keeping it in my Drafts mail folder. I was not amused to access
the message using the latest nightly build, and find the following.

* {anything}@{anything} is assumed to be an e-mail address, even if {anything}
  has non-e-mail-characters such as `)' in it. What's up with that? What's wrong
  with "mailto:"?

* The smiley algorithm interfered with my picture (excerpt attached). Make the
  smiley faces go away. NOW.

Matthew `a smiley killed my father' Thomas
Matthew Thomas,
thanks for your notes.

First: I'm not responsible for the URL and smiley things :-). Nevertheless, a
discussion on IRC brought up the same problem, and I fixed it. I also readded
the plaintetxt tags to the generated msg (the stars in *bold* are included in
the msg content now). I didn't use content-before/after, in part, because it
might get lost in a HTML reply viewed in other MUAs. The code lies on my
machine, because the tree is closed.

At the moment, we (with my version) search for the following patterns: ":-)",
":-(, ";-), ";-P", " :)" and " :(". (":(" occurs C++ Code.) They are still very
wide, because they have to catch e.g. "... bla :-).". Remember, you can always
disable it, we even have separate prefs for Gylph substitution and structured
phrases.

Just look to your own post to answer, why we try to find email adresses w/o
"mailto:" (BTW: my domain is "bucksch.org", not "bucksh.org"). Can you point me
to the RFC and place, where the valid characters of email adresses are defined?
"(" is at least valid in general URLs.

>Because it will lead people to assume that if something doesn't start with
>`www.', it's not a Web addresss.

Sorry, but this reasoning is broken, for everything. A <= B must not be true, if
A => B.
There's no problem, if it doesn't highlight cnn.com, it's no valid url. All we
do is guess, that www.cnn.com is one.

Structured phrases:
> - you'd convert "some random value of /x/" to "some random value of
> <em>x</em>", when it should be "some random value of <var>x</var>".

This is a misuse of the convention, because |code| is used for marking code
fragments. Nevertheless, I see nothing wrong in emphazising "x".

> - you'd convert "I saw /Gone with the wind/" to "I saw <em>Gone with the
> wind</em>", when it should be "I saw <cite>Gone with the wind</cite>"

dito (with you should have used """, but strong is not bad).

> This is plain text we're dealing with, remember, so it's not as if we're
> making things any worse.

I have to give that back. Speaking of structured phrases: I just *add* markup,
the content remains now. And I hope, reading humans will be able to correct the
1% of wrong markup we may cause, although I try to avoid wrong markup if
possible.

> Are you going to try to turn >s into <blockquote class="cite"></blockquote>

I don't understand that.
Corrections:
> All we do is guess, that www.cnn.com is one.
All we do is guess, that www.cnn.com can be transformed into one and that it was
the intention of the author to do so.

> Nevertheless, I see nothing wrong in emphazising "x".
Nevertheless, I see nothing wrong in emphazising "/x/".
I just noted another problem:
"<em class=txt_star>/italic/</em>" is doubled markup and conversion back to
plaintext (done as usual at Mozilla Mailnews) will result in "//italic//". (We
will stop here, because I don't convert "//italic//".) Of course, we can avoid
that in own our conversion, but not in the ones of other mailers (mail in
plaintext, reply via HTML, reply via plaintext).
Maybe a <span class=txt_star>/italic/</span>" is better. But this will take the
ability from other mailers to pretty-up the text (while it doesn't force to do
so).
1. So the style-triggering characters are being left behind when the styles are
applied. Good. One less thing for GNUS users to gloat about.

2. In the same vein, why not just colorize smileys instead of replacing them with
a graphic? For example in ":-)", make the ":" blue (eyes), the "-" brown (nose),
and the ")" dark red (mouth)? That would get the effect across, without
corrupting accidental smileys, in the same way as making *this* bold gets the
effect across without corrupting *accidental* strings.

2. Here's a test case for you: Senator John Smith (R) said today that he was not
amused at Mozilla thinking his name was a registered trademark ...

3.
> But this will take the ability from other mailers to pretty-up the text (while
> it doesn't force to do so).

Let's get this straight ... are we applying all this formatting to *outgoing*
messages, or just to the display (and not the replying-to or forwarding) of
*incoming* messages? I sincerely hope it's just the latter ... people won't be
happy if, Eudora-Mail-like, you're misrepresenting the contents of forwarded
messages.

4. What I meant by converting >s to <blockquote>s is converting, for example,
> > foo!
> bar!

to <blockquote class="cite"><blockquote class="cite">foo!</blockquote>bar!</
blockquote>. But it's a bad idea, so don't do it. :-)

5. Section 6.1 of http://www.faqs.org/rfcs/rfc822.html says that the local part
of an e-mail address must be
word *("." word)
and I'd be surprised if `word' included brackets (it's not defined in the RFC).
But that might be outside the scope of this bug ...?
Matthew Thomas,
I just noticed, that you brought that discussion to alt.ascii. I'm not sure, if
I like the discussion to be that open.

2. (The first) That's rhp's decission, but I like the graphics smilies. If they
start to annoy me, I'll disable Glyph substitution.

2. (The second) Tomorrow, will somebody say ":-)" is a valid word in some
language :-). Dunno, what to do with that. Anyone else?

(Two "2."s: That's the reason, why we have HTML mail :-).)

3. When I started implementing this, I assumed, I only change display. Later,
Akk told me, that we generate the quote in a HTML reply from the displayed msg.
This would include the smiley reference, which would break. Akk?

4. Can you explain, why that's a bad idea? Don't tell me, it's used in
ASCII-art.

5. "word" is defined in Section 3.3. ("(" may occur in the local part of an
email address.)
0. Because (a) Mozilla is open source; (b) along with Forte (Free/non-free)
Agent, Mozilla is a popular choice for ASCII artists because it doesn't munge
ASCII art, *yet*; and (c) posting there was the best way I could think of to
solicit the opinions of the ASCII art community. I might be clever, but I can't
necessarily think of every impact this smoke-and-mirrors stuff will have on ASCII
art. The newsgroup as a whole has a better shot at being able to. Open source,
y'see.

3. Having the styles inserted only for display, not for replying/forwarding,
would solve the //italic// problem you described earlier, wouldn't it?

4. Because various clients use different symbols for quoting -- some use ">",
some "> ", others use ": " or "| ", some let the user select the symbol. You're
going to have a difficult job working out whether something's quoted or not, and
the net result will probably be making it look *more* of a mess.

There's a point, I think, where you've got to accept that most people who use
plain-text mail are doing so because they *want* plain-text mail. If they want
fancy formatting, they'll use HTML mail. So don't try to force too much fancy
formatting on their plain-text messages. Linkifying: fine. Emboldening/
italicizing: ok. Colorizing: perhaps. Inserting, deleting, or changing
characters: uh uh. Going too far.
> (a) Mozilla is open source
Ah. Really? :-) (Note: This was just a joke.)

Posting to alt.ascii-art leads to a biased result, because groups with other
interests are not appropiately represented. We can't get a vote from the whole
usenet before each feature. If I knew, to which discussion this feature would
lead, I would not had started to implement it. And I don't know, if that is in
the best interest of our users.

> 4. Because various clients use different symbols for quoting -- some use ">",
> some "> ", others use ": " or "| "

The algorithm in Netscape Messenger 4.x works quite well.

> There's a point, I think, where you've got to accept that most people who use
> plain-text mail are doing so because they *want* plain-text mail.

I disagree. I'm almost certain, most users user plain text mail only for
compatibility reasons. If not, the web would be plain text plus links.
2. I don't much like the graphic smileys, but I'm sure someone will, and maybe
I'd get used to them.  I do think I would like the ability to have (R) turned
into a real trademark symbol (but mozilla doesn't do trademark symbols on Unix,
sadly, so until that bug is fixed, if we did that substitution we'd see nothing
at all instead of the (R)).

3. If you see it in the mail window, then that's what the message actually
contains as far as mozilla is concerned, and if you reply to it, that's what
you'll be replying to and you have to trust to the output system to convert it
back in a reasonable way.  The smiley glyph is a good point -- there's no code
there now to detect it and turn it back into a smiley.  Maybe we need to add
that.
4. Yes, plaintext quotes of recognized formats will be turned into blockquote
cites.  Currently, the only "recognized format" is a leading >.  This shouldn't
be a problem for people who use other quote characters (e.g. leading |) -- we'll
just keep those as plaintext quotes just as in the no-substitution case.  4.x
didn't understand quote characters other than "> " either (e.g. it didn't change
them to the user-defined quote color and font), but there didn't seem to be many
complaints about that.

Re why people use ascii mode: Put me down as someone who uses plaintext for
compatibility reasons.  If really do get to the point where we have reliable
substitution in both directions, I might embrace the new semi-ascii mode.  But
it seems fairly clear that we need to keep a "complete ascii" mode for people
who prefer that mode, in which no substitution at all is done, and one can rely
on ascii art, tables, etc. coming through unchanged.  In fact, we should have an
easy way of switching modes (something in the View menu, probably), so that if I
normally use substitution mode but someone sends me something that abviously has
ascii art in it and it's not displaying correctly, I can toggle a switch and see
the original message untouched.
Ugh.  We were just talking about smiley substitution on IRC, and I realized that
this will totally break an idiom I use a lot:

Some text (with a little joke :-) and some more text

In other words, I use the ) in a smiley to close a parenthetical expression as
well as to be the mouth of the smiley, because I don't like having two
close-parens next to each other.  Now people using mozilla will see all my
parens as being unbalanced. :-(  Parsing for paren balancing to see if a smiley
is being used this way sounds nontrivial, though.
Yes, I use the bracket-smiley combo too (like this:-). Which is one of the
reasons I suggest a smiley be colorized, rather than converted to a graphic.

The bottom line is that I have little problem with various styles being applied
to certain strings, but I draw the line at actually changing the text. And I
would apply the same principle to blockquote citing, because it'll do the wrong
thing to this (for example):
> > > IMPORTANT ANNOUNCEMENT!!! < < <

Anyway, having a toggle item in the View menu for `Smart Styling' sounds like a
good idea, as long as its value is persistent (it stays the same between messages
and between sessions).
Akk,
3. I don't think, I like the way we create HTML replies. Preparing display and
creating content are different tasks.
We *have* to convert the smily substitution back: I don't think, Outlook Express
will be able to use the "chrome://" URLs - data loss bug. (R) is a similar
problem. If we currently don't display &reg;, how can we trust, that all
recipients do?
Structures phrases are not that bad, since I don't remove content anymore, but
if we misstyle quotes, others could smile about us and our users; confusion and
flames are possible reactions, too.

4. I have no problem with "> > > foo < < <" being interpreted as quote: this is
because the sender ignored widely used internet rules and no real content is
lost.

> Now people using mozilla will see all my parens as being unbalanced. :-(

They *are* unbalanced.

Matthew Thomas,
I think, what you want is a pref. The idea behind the toggle-menuitem/-icon is
exactly the per-message basis.
When replying, Moilla shouldn't be converting the converted text->HTML back to
some semblance of the original text; it should be using the exact text from the
message source. Wouldn't that avoid a whole lot of hassle with data loss or
corruption? The text->HTML conversion should be used only for display purposes,
IMO.

4. I don't think there's such a thing as a `widely used Internet rule' for
quoting. Sure, make >ed text smaller, italic, green, or whatever. But 3 + 2
> 4 ... which is why you should leave the > symbol there, because it *might* be being used for something other than quoting, as I just showed in that equation. Just like the *asterisks* or the /slashes/ *might* be being used for something other than emphasis.

And yes, Ben, I do want a pref. And I do want it in the View menu, not hidden
away in the prefs dialog, for the same reason rot13 belongs in the View menu and
not in the prefs dialog -- because it's something you generally need instant
access to.
Akk, "(bla :))"-problem:
there was a discussion on alt.ascii-art, thread "Smiley-face query"
<URL:http://www.deja.com/viewthread.xp?search=thread&recnum=%3c68j7dn$ccq@tron.sci.fi%3e%231/1>
about this.
One quote: "In any case, it'd say we can all agree on the following points:
[...] Emoticons cannot close a parenthesis."
(<URL:http://x21.deja.com/getdoc.xp?AN=312573064>).
I've changed the smily detection code (in my tree) to avoid problems with ":-))"
etc.
It searches after the following pattern: SPACE - Smily [- [.|,|;]] - WHITESPACE.
"WHITESPACE" means nsString::IsSpace return true, "[- [.|,|;]]" mean, that
optionally either ".", "," or ";" may appear after the smily (and stay in the
msg). Everything else is ignored.
Any objections?
Depends on: 18718
The latter was a bit ambiguously. My most recent changes avoid problems with
*smilies* like " :-)) ", " :-(( " etc., not the "Bla (bla :-) bla." (instead of
"Bla (bla :-)) bla.") problem.

I created bug #18718 and a dependency for the "graphical smily etc. in reply"
problem.
Thanks to Daniel Bratell for pointing me to the (original) post.

>> Warren Harris wrote:
>>> We probably need an extensible way (based on protocols modules) to
>>> recognize strings as URLs. I don't know
>>> whether that needs to be a special method, or whether your code can just
>>> look for "<alphanumeric>*:" up to the
>>> next whitespace character, and then just try to construct a URL from it.
>>> If it succeeds, then highlight it, if not, don't.> Ben Bucksch wrote:
Warren Harris wrote:
> Ben Bucksch wrote:
> > "<alphanumeric>*:<non-whitespace>*" triggers far too often.
> > Users already complain, that "file://" urls are turned into links.
> >
> > Do you have an idea, how to make this dynamic?
>
> Yes, I think you/they should look for the pattern suggested above, and then
try calling
> NS_NewURI. This should only succeed if the protocol exists, and the string is
a
> syntactically valid URL.
>
> I guess we should special-case file: since that's almost always a link to the
sender's
> machine, and not the receiver's. Alternatively, we could call the
nsIFileChannel::Exists
> method and only highlight the URL if it's there.

I like this idea. Since the main purpose of nsMimeURLUtils is this
recognition, I started rewriting the class.
WOW: txt2html.pl <http://www.thehouse.org/txt2html/>
Target Milestone: M12 → M14
m14.  let me know if there are more changes ready for this in
in the next couple of days and we can see about getting them into m12.

maybe this is even post beta1?
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Target Milestone: M14 → M12
chofmann,
it has been checked in together with bug #19251 recently.
M12 FIXED.
I think it's safe to mark this verified.
The code is there.
Any specific bugs we find will be filed separately.
asj@ipa.net has graciously offered to test this feature.  He has started writing 
tests at: http://www.mozilla.org/quality/mailnews/tests/mn-html-to-txt.txt
Status: RESOLVED → VERIFIED
QA Contact: lchiang → asj
Product: MailNews → Core
Product: Core → MailNews Core
My Bug 522893 was marked as a duplicate. In Bug 522893 it was recommended that I try out:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-1.9.1/

Which I just did. Upon install it "imported" my existing mail folders. The emails that had the problem in TBird 2 still have it in 3. I'll attach a screen shot from one of them. Doesn't appear fixed to me. Unless it somehow has to do with message storage format on disk and I need to test with freshly received messages?
Screen shot of old saved (TBird 2) mail message that still has truncated URL when viewed in Tbird 3.
This bug is FIXED. Bug 522893 is not a duplicate, I'll reopen the latter.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: