Last Comment Bug 16507 - Improve Plain text -> HTML
: Improve Plain text -> HTML
Status: VERIFIED FIXED
:
Product: MailNews Core
Classification: Components
Component: MIME (show other bugs)
: Trunk
: All All
: P3 enhancement (vote)
: M12
Assigned To: Ben Bucksch (:BenB)
: Alan S. Jones
Mentors:
http://www.bucksch.org/1/projects/moz...
Depends on: 454 18718
Blocks:
  Show dependency treegraph
 
Reported: 1999-10-15 03:02 PDT by Ben Bucksch (:BenB)
Modified: 2009-10-21 09:30 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
conversion *bold* -> strong, _italic_ -> em (10.12 KB, patch)
1999-10-19 11:21 PDT, Ben Bucksch (:BenB)
no flags Details | Diff | Splinter Review
Rewritten using nsString, bugs fixed (12.27 KB, patch)
1999-10-21 11:43 PDT, Ben Bucksch (:BenB)
no flags Details | Diff | Splinter Review
Fixes a leak in last patch. (3.87 KB, patch)
1999-10-22 06:30 PDT, Ben Bucksch (:BenB)
no flags Details | Diff | Splinter Review
Some (more) glyph substitution and exponents (5.45 KB, patch)
1999-10-29 10:44 PDT, Ben Bucksch (:BenB)
no flags Details | Diff | Splinter Review
damage done to ASCII art (1.93 KB, text/plain)
1999-11-07 00:17 PST, Matthew Thomas, usability weenie
no flags Details
[Preliminary] GlyphSubstitution, code, quote, class attributes (23.07 KB, patch)
1999-11-12 23:37 PST, Ben Bucksch (:BenB)
no flags Details | Diff | Splinter Review
Still truncates URLs in old saved messages (9.88 KB, image/png)
2009-10-21 09:22 PDT, Doug Hockin
no flags Details

Description Ben Bucksch (:BenB) 1999-10-15 03:02:22 PDT
*bold* <-> <strong>
_italic_ <-> <em>
URL <URL:ftp://venera.isi.edu/in-notes/rfc1738.txt> <-> <a
href="ftp://venera.isi.edu/in-notes/rfc1738.txt">URL</a>
Comment 1 Jean-Francois Ducarroz 1999-10-15 09:24:59 PDT
I don't know if we will support directly those formating command during composition but we should be able to
display them correctly. About the URLs, if I am not wrong, we already support it during the display of a message.
Comment 2 Ben Bucksch (:BenB) 1999-10-15 09:43:59 PDT
Yes, URLs are clickable (although the <URL: and > still appears).

Why not support during composition? (Assuming composition is done via HTML
editor and then converted to plain text.) We could stop data loss.

At the moment, I'm just "evaluating" (browsing through the code).
Comment 3 Phil Peterson 1999-10-15 12:15:59 PDT
Ben, are you talking about these improvements in the context of reading a plain
text message which has *foobar* and then generating <b>foobar</b> for display
purposes? If so, rhp@netscape.com is the right owner for the bug.

Or are you talking about doing something when composing a message? If so, I'm
not clear what your suggesting.

In any case, I don't think we need to do any more with URLs than we do. We
recognize them just fine without the extra syntax.
Comment 4 Ben Bucksch (:BenB) 1999-10-15 12:47:59 PDT
Both. The first case together with your metion of URLs catches the part plain
text -> HTML.

The conversion after composition from HTML -> plain text e.g. looses URLs and
other formatting (URLs being the worst).

Let's <em>say</em> I <strong>composed</strong> this <a
href="http://www.mozilla.org">message</a>

and decide to send m/a. The plain text part looks like this:

Let's say I composed this message

but it should look like this:

Let's _say_ I *composed* this message <URL:http://www.mozilla.org>

I assigned it to me, because I wanted to see, what I can do. If you want it to
be implemented soon, assign it to rhp.
Comment 5 Karl Ove Hufthammer 1999-10-15 12:54:59 PDT
Perhaps also foo^h^h^hbar in plain text could displayed as
<strike>foo</strike>bar?
Comment 6 Ben Bucksch (:BenB) 1999-10-19 11:21:59 PDT
Created attachment 2271 [details] [diff] [review]
conversion *bold* -> strong, _italic_ -> em
Comment 7 Ben Bucksch (:BenB) 1999-10-19 11:29:59 PDT
rhp,

could you please review my patch and check it in?

Please review very very carefully. I'm sure, there're still all kinds of bugs.
At least, it's egcs proven and seems to work on Linux. (I have no licence for VC
:-(.)

Deleteme are tmp. comments for you.
Comment 8 Phil Peterson 1999-10-19 11:30:59 PDT
This all sounds like good stuff. I completely agree with Ben's 10/15 comments
that sending <a> links through plain text should include the URL.

I'm cc'ing rhp since he can probably suggest places where reading plain text and
generating HTML could be improved (as with the smiley face for :-) in mozilla)
and akkana since she can probably suggest places to improve outputting the
editor content model as plain text.

Ben, if I were you, I might split this bug up into several smaller ones, but
it's your call.
Comment 9 Mike Shaver (:shaver -- probably not reading bugmail closely) 1999-10-19 11:36:59 PDT
I always thought it would be _underline_, *bold*, /italic/.  Isn't that the
traditional way?

(I'd love to see |code| work, too.)
Comment 10 rhp (gone) 1999-10-19 11:43:59 PDT
Yes, the ability to do this type of formatting/text recognition is in the code
today and was enhanced from 4.x to 5.0. In 5.0, plain text URL's that are not
prefaced with the protocol (i.e. www.netscape.com) will be recognized as URL's.
Also, email address that are just typed as rhp@netscape.com will also be
link-a-fied.

Something I should point out is that even plain text mail display is being done
with an HTML capable rendering engine, so you can tweak the text however you
want with HTML tags and the display will do the right thing.

While I was doing that, I played with having emoticons display as the image
they are supposed to represent. So :-) got replaced with a little smiley face.
Of course, "purists" told me I was ruining the Internet so that is why I put it
on a preference setting.

The code to do this is somewhat isolated, but when I get some spare time (ha,
ha, ha, ha...ok, I'm done) I wanted to make this thing truly extensible. I
would love to be able to have an interface that would let you do whatever you
wanted to do to the output before display.

The code of interest here is in the file:

   http://lxr.mozilla.org/mozilla/source/mailnews/mime/src/nsMimeURLUtils.cpp

Look at the function:

        nsresult        nsMimeURLUtils::ScanForURLs()

and you can see what is going on.

Enjoy!

- rhp
Comment 11 Akkana Peck 1999-10-19 11:44:59 PDT
Adding myself to cc list.  I'm puzzled why this is in libmime -- shouldn't it
live in the normal output methods, in nsHTMLToTXTSinkStream.?  Does libmime do
its own output conversion?

It would be pretty easy to add these conversions to nsHTMLToTXTSinkStream.
Comment 12 Phil Peterson 1999-10-19 11:46:59 PDT
Akkana, I think the reading side is in libmime, but the writing side uses your
output stuff.
Comment 13 rhp (gone) 1999-10-19 11:47:59 PDT
The rendering side of this lives in libmime. We also do some basic link-a-fying
in the compose back end for when you type http://www.netscape.com into an HTML
compose window, but don't acually create the link.

- rhp
Comment 14 Ben Bucksch (:BenB) 1999-10-19 12:16:59 PDT
Does anybody know a RFC that can second/deny Mike's comment?

|code| would be very easy to implement. But I need more info about usage, I've
never seen that. Is this used for vars or code fragments? Are they aligned in
blocks? I build in many security proofs, so none of the following would be
converted:
|code;| |<code>|
|code
code|
But I could change this, if someone cann tell me more details, or, even better,
a spec.
Comment 15 Ben Bucksch (:BenB) 1999-10-19 12:53:59 PDT
Obviously, we need to splitt HTML -> plain text conversion off. Created bug
#16800.

Summary changed. Not sure what Component "Networking-Mail" means, choosed MIME.
Comment 16 Ben Bucksch (:BenB) 1999-10-19 13:00:59 PDT
It's just a cosmetic change, but it would be nice, if <URL:...> would be
converted to just <a href="...">...</a>, not &lt;URL:<a href="...">...</a>&gt;.
Comment 17 lchiang 1999-10-19 14:04:59 PDT
Eventually, this will have to be tested.  Is there a spec or something that we
can follow and write testcases for so that we get good coverage on this feature?
Comment 18 Ben Bucksch (:BenB) 1999-10-19 15:35:59 PDT
lchiang,

"Spec" is in the source as comment :-):

*Bold* -> <strong>
DELIMITER: not alphanumeric and not "*"
We're searching for the following pattern:
DELIMITER - "*" - ALPHA -
[ some text (maybe more "*"-pairs) - ALPHA ] "*" - DELIMITER.
<strong> is only inserted, if existance of a pair could be verified
Same for _italic_ -> <em>

This is generally used to *stress* some word or *some phrase*, both cases should
be covered, many others * bold * are excluded by intend not be be triggered by
"5 * 3 * 4 = 60" (savety first).

Providing test cases would make QA somewhat useless. (My own test cases work, of
course.)
Comment 19 Phil Peterson 1999-10-19 15:38:59 PDT
Not sure what HTML you'd generate for |code|. Since we're reading a text/plain
message, it would already be rendered in a monospace font. Not much point
in wrapping it in a <pre>. shaver, did you have something in mind?

Lisa, I think this is testable in pretty much the same way as colorizing quoted
material in plain text messages:
1. Send yourself a plain text message with *foo* /bar/ _baz_
2. Read the message, and note that foo is bold, bar is italicized, and baz is
underlined.

Further capabilities (like what we do with |code| or <URL:xxx> TBD, I think.
Comment 20 Ben Bucksch (:BenB) 1999-10-19 15:39:59 PDT
lchaing,

I just saw, *you* wanted to write testcases. Sorry, misuderstood you.

BTW: "[ something ]" means "something" is optional.
Is the Spec clear enough?
Comment 21 Ben Bucksch (:BenB) 1999-10-19 15:45:59 PDT
Phil,

<code> come to my mind :-):
<URL:http://www.w3.org/TR/REC-html40/struct/text.html#h-9.2.1>.

Plaintext is not neccessary rendered as monospaced (at least in 4.x). I'm
reading in a proportional font (screwing up tables and ASCII-art :-( ).

But even if display is monospaced, <code> should be rendered differently to
distinguish it from (prosa) text.
Comment 22 lchiang 1999-10-19 15:50:59 PDT
(Thanks - I will review all this next week)
Comment 23 Ben Bucksch (:BenB) 1999-10-19 17:16:59 PDT
See <URL:news://news.mozilla.org/380D04D9.4D86E941@bucksch.org> ("ASCII-art
detection" under "Assuming "plain text" or "html mail".." at n.p.m.mail-news
from 19 Oct 99 23:55:05 GMT) for ASCII-art proposal.
Comment 24 John Moreno 1999-10-20 08:57:59 PDT
I agree with Shaver, /italic/ and _underline_ are frequently done this way.



I know of an amiga newsreader that does this, should I find out what all it does?
Comment 25 Ben Bucksch (:BenB) 1999-10-20 09:51:59 PDT
Planb,
see shaver's comment on bug 16800. But a RFC or at least an Internet-Draft would
be really helpful, I couldn't find any mention.
Comment 26 Ben Bucksch (:BenB) 1999-10-21 11:43:59 PDT
Created attachment 2328 [details] [diff] [review]
Rewritten using nsString, bugs fixed
Comment 27 Ben Bucksch (:BenB) 1999-10-21 11:51:59 PDT
Usual warnings apply, this time especially regarding the passing of nsString
between functions (leaks etc.). Again: I'm unable to take any responsibility for
the code :-(.

/italic/ works now.
_underline_ is transformed to <em>, since <u> is deprecated, I would have to use
stylesheets. Any ideas?
|code| is commented out, because it is invisible in monospaced viewers and I
remove the "|". It also works the same as *bold*, need more info (see my notes
above).
Comment 28 Phil Peterson 1999-10-21 14:24:59 PDT
RichP, would you code review this please?
Comment 29 Ben Bucksch (:BenB) 1999-10-22 06:30:59 PDT
Created attachment 2354 [details] [diff] [review]
Fixes a leak in last patch.
Comment 30 rhp (gone) 1999-10-22 20:11:59 PDT
Sorry for the delay in this review. Looks good to me!

- rhp
Comment 31 Ben Bucksch (:BenB) 1999-10-22 23:50:59 PDT
Cool. Marked M11.

Need suggestions for ascii-art detection, see
news://news.mozilla.org/380D04D9.4D86E941@bucksch.org or
http://www.deja.com/msgid.xp?MID=<380D04D9.4D86E941@bucksch.org>
and it's reply. BTW: I just noticed, dejanews uses "<" and ">" in URLs. Nice.

|StructPhraseHit(nsCAutoString text, PRBool col0, ...|
should better be
|StructPhraseHit(const nsCAutoString text, PRBool col0, ...|
Comment 32 Ben Bucksch (:BenB) 1999-10-26 09:58:59 PDT
Assigning to rhp, so he can check it in.
Comment 33 rhp (gone) 1999-10-27 10:12:59 PDT
Hi Ben,
I am really suspect of the Right() call. I don't understand why you don't get
garbage on return. If you do get a valid string returned, then the
nsCAutoString is returning an allocated string, which means we are leaking. The
problem is there are tons of string classes so I am unsure of the exact
behavior.

I would probably return a newly allocated string and free it on the calling
side. This may be what nsCAutoString is doing (without the free), but I'm not
sure. I know that nsString.ToNewCString() will do this, and then you have to
free the memory.

- rhp
Comment 34 Ben Bucksch (:BenB) 1999-10-27 12:34:59 PDT
Rich,

I did some research and everything is like I hoped it would be :-). I love C++.

- return copies
Objects are returned by invoking the copy constructor (1997 C++ Public Review
Document, Section [class.copy],
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/special.html#class.copy>).
This is the reason, why I "don't get garbage".
[stmt.return]
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/stmt.html#stmt.return>
- Destruction on out of scope
If an (automatic) object falls out of scope, the destructor is called.
[class.dtor], paragraph 10, case 2
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/special.html#class.dtor>
- Example
An example of my usage of objects is at [class.temporary], Paragraph 2
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/special.html#class.dtor>

- |ns*AutoString|s free the memory at destruction.
"The point of nsAutoStrings is [...] to auto-destroy the string when it goes out
of scope." (<URL:http://lxr.mozilla.org/seamonkey/source/xpcom/ds/nsStr.h#132>)
If my understanding of |ns*String| is correct, all |ns*String|s free the memory
at destruction, if they own it (see
<URL:http://lxr.mozilla.org/seamonkey/source/xpcom/ds/nsString.cpp#137> and
<URL:http://lxr.mozilla.org/seamonkey/source/xpcom/ds/nsStr.h#239>).

Reassigning to me, since checkin is done.
Comment 35 Ben Bucksch (:BenB) 1999-10-27 13:18:59 PDT
Sorry, the link for the example is wrong. The correct one is:
<URL:http://www.maths.warwick.ac.uk/cpp/pub/wp/html/cd2/special.html#class.temporary>
Comment 36 Ben Bucksch (:BenB) 1999-10-28 07:35:59 PDT
huftis,
I didn't forget your question^H^H^H^H^H^H^Hproposal, but it will be hard to
implement, because the code walks char by char through the msg. The other plain
text tags enclose the phrase like HTML tags do, so I could just substitute. I
think, /I/ won't implement that^H^H^H^Hyour proposal.
Comment 37 Karl Ove Hufthammer 1999-10-28 10:45:59 PDT
> huftis,
> I didn't forget your question^H^H^H^H^H^H^Hproposal,
> but it will be hard to implement. I think, /I/ won't
> implement that^H^H^H^Hyour proposal.

OK, but how about character substitution. Example:

=>          -->   U+21D2
--> or ->   -->   U+2192

And perhaps even:
^2     -->   U+00B2
1/2    -->   U+00BD
(C)    -->   U+00A9
Comment 38 Ben Bucksch (:BenB) 1999-10-29 10:44:59 PDT
Created attachment 2482 [details] [diff] [review]
Some (more) glyph substitution and exponents
Comment 39 Ben Bucksch (:BenB) 1999-10-29 11:00:59 PDT
huftis,
the following strings are not substituted:
|TXT   |HTML     |Reason
+------+---------+----------
 ->     &larr;    Char not displayed on Linux (not even a placeholder)*
 =>     &lArr;    dito
 <-     &rarr;    dito
 <=     &rArr;    dito
 (tm)   &trade;   dito
 1/4    &frac14;  is triggered by 1/4 Part 1, 2/4 Part 2, ...
 3/4    &frac34;  dito
 1/2    &frac12;  similar
 !=     &ne;        used in C/C++(-pseudo)-code
 <=     &le;        dito
 ...    ...         dito
+------+---------+------------
*I'd like to know why.
I'm substituting "(c)", "(r)" and "+/-" (using rhp's glyph substitution code),
but I'm not even sure, if the signs for these display correctly on all platforms
(tested only on Linux).
You might be interested in
<URL:http://www.w3.org/TR/REC-html40/sgml/entities.html>.

rhp,
could you please review that and check it in? Tnx.

QA,
Test this (all my patches) with wild and unusual test cases. Every substitution,
where it shouldn't be, is a bug. File it against me.
Comment 40 Ben Bucksch (:BenB) 1999-10-29 11:04:59 PDT
Oh, I forgot the best :-): Exponents are <sup>'ed.

Changing Summary.
Comment 41 Karl Ove Hufthammer 1999-10-29 11:42:59 PDT
Wouldn't it be better if ^5 was substituted with &#x2075; (U+2075). If the font
didn't contain that glyph, it could *then* be converted to <SUP>5</SUP> (see
BUG #12662 <URL:http://bugzilla.mozilla.org/show_bug.cgi?id=12662>).
Comment 42 Akkana Peck 1999-10-29 11:46:59 PDT
I've filed a couple of bugs on some of these entities which aren't displayed in
Linux (&bull; is another one).  Bug 454 seems to be the main bug concerning
these (currently marked as TRIVIAL so I wonder if we're going to be stuck with
this bug forever); 5383 concerned &trade; but was duped to 454; 16872 is another
one on &bull specifically (which might be a different issue since it can be done
in gfx instead of requiring a font that has those characters).
Comment 43 Matthew Tuck [:CodeMachine] 1999-10-31 22:57:59 PST
If _a_ means underline, you should use underline.  If you don't want to use <u>,
then use <span> and CSS.

It's certainly better to have HTML be document markup rather than
presentational, but these ARE distinctly presentational, and I think they should
be rendered presentationally.  The same for italic - making it em seems wrong to
me.
Comment 44 Ben Bucksch (:BenB) 1999-11-01 04:47:59 PST
Matty,
what is the plain text equivalent to <em>? I usually don't want to underline
something, I want to stress something in different levels. I think, <u> is
deprecated for a good reason.
Comment 45 Akkana Peck 1999-11-01 10:36:59 PST
I would suggest that *emphasized text* is used more to indicate emphasis (i.e.
<em>) than to indicate bold.  I would expect *starred text to be in italic* (or
whatever is being used for <em> and ALL_CAPS text to be in BOLD.

I agree that _underlined_text_ should map to <u> since that ascii construct is
very specific (and awkward to type, so no one would use it unless they really do
mean underline).
Comment 46 Ben Bucksch (:BenB) 1999-11-01 11:48:59 PST
*sigh*

The only thing, on which we all agreed till now, were that *stars* mean bold.
I use _this_ usually to emphasize something (but not as much as *bold*), having
italic in mind. So I _do_ use it and do not mean underline. And I think, others
do, too.

We can't use <u>, because it's deprecated; we would have to add a stylesheet.

My personal opinion is: I want presentational layout to die. And underline is an
ugly looking leftover from the times of typesetting machines, where there were
no other methods to stress something.
Comment 47 Ben Bucksch (:BenB) 1999-11-01 11:52:59 PST
What, if I add a pref "mail.do_struct_phrase_presentational" defaulting to
FALSE, that maps *bold* to <b>, /italic/ to <i> and _underline_ to underlining
based on stylesheets?

(We already have an "mail.do_struct_phrase" (and "mail.do_glyph_substitution"
BTW). Maybe, we could compensate them and change the names.)
Comment 48 Matthew Tuck [:CodeMachine] 1999-11-01 13:37:59 PST
Ben, you shouldn't use <B> any more than <U>.  It may not be deprecated, but
it's still presentational, and hopefully it will be deprecated in future.  I
think you can use a span and have a style attribute that will allow you to do
all with CSS.

Regarding translating em and strong to plain text, I'm not really sure, but it
would be really nice to use a stylesheet.  That sounds quite complicated
though.  Maybe just use bold.

I guess it seems they aren't used consistently, in which case your original
mappings make sense.  This way at least the user can edit their plaintext
stylesheet to reflect how they want messages displayed.
Comment 49 Ben Bucksch (:BenB) 1999-11-01 14:15:59 PST
Matty,
a stlyesheet for all formatting sounds like a good idea to me. Unfortunately, I
never really worked with stylesheets. May take some time till code follows.

HTML -> plain text is offtopic here now, see bug #16800. I don't know, how you
want to use a stylesheet for HTML -> plain text conversion, but this is too late
anyway, I think (unless this is a really good idea), because I already have
working code. I just need to make last checks.
Comment 50 Matthew Thomas, usability weenie 1999-11-01 14:37:59 PST
My two cents, as maintainer of the news:alt.ascii-art FAQ
(http://cantua.canterbury.ac.nz/~mpt26/art/ascii/faq/):

(1) ASCII artists (and others who use ASCII art sigs, etc) are going to be *
extremely* annoyed if the formatting-detection algorithm guesses wrong. And it *
will* guess wrong occasionally, no matter how good it is. (What happens, for
example, if I insert a /*C comment*/ ...) ASCII artists like Mozilla (in its
current incarnation as Netscape Messenger) because, unlike MS Outlook, it leaves
ASCII art alone. And they'd rather it stayed that way.

As a compromise, I would suggest doing it the way (IIRC) XEmacs' mail reader does
it. That is, apply the formatting, but *leave the special characters there*
(perhaps dim them, but leave them there). That way things won't get too
mangled if the algorithm guesses wrong.

(2) When I use /slashes/, sometimes I mean <em>, and sometimes I mean <cite>. You
can't know which I mean, because text isn't structural markup. So I see no option
except to use CSS italics. Similarly for *asterisks*, you can't know whether I
mean <strong> or <vector-space> or whatever, so CSS bold is really the only
option.

(3) If URLs without protocols are going to be detected as http addresses, as
rhp@netscape.com suggests, surely the wrong thing is going to happen for these:
* you can download this by anonymous FTP at foo.bar.net
* to begin, telnet to library.canterbury.ac.nz and log in as "guest"
* you are cordially invited to mozilla.party three.oh

And please don't tell me you're just going to do it for domains starting with
`www.', or I'll scream.

-- mpt (http://critique.net.nz/ -- not a www. in sight)
Comment 51 Mike Shaver (:shaver -- probably not reading bugmail closely) 1999-11-01 14:46:59 PST
What he said.  I regularly send stuff like this:

  ``we should just prune entries matching /(mozilla\.org|netscape\.com)$/''

  ``go into your srcdir and rm *TitledButton*, then update your tree''

and I mean neither italics nor emphasis.

DWIM is very hard to get right.
Comment 52 Ben Bucksch (:BenB) 1999-11-01 17:07:59 PST
mpt and Mike,

/(mozilla\.org|netscape\.com)$/'' would not be changed as described to lchiang.
But rm *TitledButton* and /*This comment*/ would. I'll let the plaintext tags in
the text.

mpt,

what is bad with em tags (maybe with a type attribute), if you have a
stylesheet?

I think, disabled people wouldn't like CSS bold. (Would they like us to leave
plain text tags in? Maybe one more pref? :-( )
Comment 53 Matthew Tuck [:CodeMachine] 1999-11-01 18:23:59 PST
Agreed that the characters should not disappear.  Disability issues should go
away then, and there's always the chance to apply a stylesheet to a plaintext
message.

What I was referring to about stylesheets on plaintext messages applies both to
this, and it would also be nice to turn quoting into using <BLOCKQUOTE> (do we
do this already?)

Sorry about being offtopic, I got confused for a minute.
Comment 54 Ben Bucksch (:BenB) 1999-11-01 19:00:59 PST
matty, sorry, I don't understand anything, what you're saying, but maybe I'm
just too tired.
Comment 55 Mike Shaver (:shaver -- probably not reading bugmail closely) 1999-11-01 21:07:59 PST
I don't see any explanation of why /[abcdef0123456789]/ wouldn't get ``fixed''
to <i>regex</i> above.  Can you elaborate?
Comment 56 Ben Bucksch (:BenB) 1999-11-02 05:51:59 PST
Mike,

the explanation in question is (to make sure, we're speaking about the same
thing) for *bold* -> <strong>:
DELIMITER: not alphanumeric and not "*"
We're searching for the following pattern:
DELIMITER - "*" - ALPHA -
[ some text (maybe more "*"-pairs) - ALPHA ] "*" - DELIMITER.
<strong> is only inserted, if existance of a pair could be verified

What do you meman with "/[abcdef0123456789]/"? Should *I* evaluate that or take
it as string? If this or "/(mozilla\.org|netscape\.com)$/" appears exactly that
way (or without the quotes) in the msg, my code would leave it. But if you mean,
if "/abc[abcdef0123456789]def/" (with or without the quotes) would be changed,
the answer is yes. (But neither "/abc[abcdef0123456789]789/" nor
"/abc[abcdef0123456789]/" would be changed.)
Comment 57 Ben Bucksch (:BenB) 1999-11-02 06:00:59 PST
It's not yet clear, what "change" means. At the moment, "/" is substitued with
"<em>". But as you pointed out, it would do the wrong thing for "rm *diff*". The
only solution I see is, as mta suggested, to let the plain text tags in and dim
them.
Mikes idea of content-before in CSS to readd the TXT tags sounds very nice, but
I'm not sure, what would happen, if we reply with HTML. The TXT tags *should*
still be there, even if the recipient uses a non-CSS-capable HTML viewer.
Comment 58 Ben Bucksch (:BenB) 1999-11-02 07:21:59 PST
mpt,
start to scream:
<URL:http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/nsMimeURLUtils.cpp#323>
But I don't know, what so wrong with guessing www.bucksch.org would really be a
reference to http://www.bucksch.org.
Comment 59 Ben Bucksch (:BenB) 1999-11-02 15:15:59 PST
Matty, I still don't understand your comment.

I think, when we /remove/ the plaintext tags, this will help disabled persons (
how will "..we slash remove slash the.." sound?). The interface (whatever this
may be) can do, what it thinks is best with tags like em and strong, and they're
well-known. This is one of the reasons why I vote for <em type=txt_italic> or
similar and not any "font-style: italic".

I never heard of stylesheets for plain text. What is that?
Comment 60 Matthew Tuck [:CodeMachine] 1999-11-02 15:54:59 PST
A stylesheet for plaintext messages, as in, by the time you do Plain Text ->
HTML, it will be styleable.

Your comment about disability is true, but for all users, the potential loss of
characters is too great a chance no matter what scheme you adopt.
Comment 61 chris hofmann 1999-11-02 20:36:59 PST
any fix in hand for this? moving to m12.
move back if its ready to resolve in the next day or so.
Comment 62 Ben Bucksch (:BenB) 1999-11-04 13:12:59 PST
I'll try to get rid of the hardcoded quote formatting, too, and use the
stylesheet.
Comment 63 Matthew Thomas, usability weenie 1999-11-07 00:13:59 PST
*** mozilla@bucksh.org said,
> `what is bad with em tags (maybe with a type attribute), if you have a
> stylesheet?'.

What is bad with them is that:

- you'd convert "I saw /Gone with the wind/" to "I saw <em>Gone with the
  wind</em>", when it should be "I saw <cite>Gone with the wind</cite>"

- you'd convert "some random value of /x/" to "some random value of <em>x</em>",
  when it should be "some random value of <var>x</var>".

See? /slashing/ is presentational content. You can't tell which semantic thingy I
mean by it, so the only honourable thing to do is to go <span style="font-style:
italic"></span> (or whatever) instead. The same with *asterisking* as either <
strong> or <vectorspace> (or whatever the MathML tag for a vector space is). You
have to do <span style="font-weight: bold"></span>.

I'm an ardent defender of the Internet rights of disabled people, but you have to
give up the semantic content in this case, simply because you don't know which
semantic content was intended. This is plain text we're dealing with, remember,
so it's not as if we're making things any worse. (Are you going to try to turn >s
into <blockquote class="cite"></blockquote>, for example? That would be similarly
confusing for disabled people ...)

*** mozilla@bucksh.org wrote:
> mpt, start to scream:
> <URL:http://lxr.mozilla.org/seamonkey/source/mailnews/mime/src/nsMimeURLUtils.cpp#323>
>
> But I don't know, what so wrong with guessing www.bucksch.org would really be
> a reference to http://www.bucksch.org.

Aaaaaaaaaaaaargh! (Further screaming available on request.) Because it will lead
people to assume that if something doesn't start with `www.', it's not a Web
addresss. It's not going to highlight cnn.com, or slashdot.org, or
home.netscape.com, for example. That sucks lots.

*** I've been drawing some ASCII art (a mockup of a new Mozilla prefs dialog,
actually), and keeping it in my Drafts mail folder. I was not amused to access
the message using the latest nightly build, and find the following.

* {anything}@{anything} is assumed to be an e-mail address, even if {anything}
  has non-e-mail-characters such as `)' in it. What's up with that? What's wrong
  with "mailto:"?

* The smiley algorithm interfered with my picture (excerpt attached). Make the
  smiley faces go away. NOW.

Matthew `a smiley killed my father' Thomas
Comment 64 Matthew Thomas, usability weenie 1999-11-07 00:17:59 PST
Created attachment 2699 [details]
damage done to ASCII art
Comment 65 Ben Bucksch (:BenB) 1999-11-07 06:21:59 PST
Matthew Thomas,
thanks for your notes.

First: I'm not responsible for the URL and smiley things :-). Nevertheless, a
discussion on IRC brought up the same problem, and I fixed it. I also readded
the plaintetxt tags to the generated msg (the stars in *bold* are included in
the msg content now). I didn't use content-before/after, in part, because it
might get lost in a HTML reply viewed in other MUAs. The code lies on my
machine, because the tree is closed.

At the moment, we (with my version) search for the following patterns: ":-)",
":-(, ";-), ";-P", " :)" and " :(". (":(" occurs C++ Code.) They are still very
wide, because they have to catch e.g. "... bla :-).". Remember, you can always
disable it, we even have separate prefs for Gylph substitution and structured
phrases.

Just look to your own post to answer, why we try to find email adresses w/o
"mailto:" (BTW: my domain is "bucksch.org", not "bucksh.org"). Can you point me
to the RFC and place, where the valid characters of email adresses are defined?
"(" is at least valid in general URLs.

>Because it will lead people to assume that if something doesn't start with
>`www.', it's not a Web addresss.

Sorry, but this reasoning is broken, for everything. A <= B must not be true, if
A => B.
There's no problem, if it doesn't highlight cnn.com, it's no valid url. All we
do is guess, that www.cnn.com is one.

Structured phrases:
> - you'd convert "some random value of /x/" to "some random value of
> <em>x</em>", when it should be "some random value of <var>x</var>".

This is a misuse of the convention, because |code| is used for marking code
fragments. Nevertheless, I see nothing wrong in emphazising "x".

> - you'd convert "I saw /Gone with the wind/" to "I saw <em>Gone with the
> wind</em>", when it should be "I saw <cite>Gone with the wind</cite>"

dito (with you should have used """, but strong is not bad).

> This is plain text we're dealing with, remember, so it's not as if we're
> making things any worse.

I have to give that back. Speaking of structured phrases: I just *add* markup,
the content remains now. And I hope, reading humans will be able to correct the
1% of wrong markup we may cause, although I try to avoid wrong markup if
possible.

> Are you going to try to turn >s into <blockquote class="cite"></blockquote>

I don't understand that.
Comment 66 Ben Bucksch (:BenB) 1999-11-07 06:31:59 PST
Corrections:
> All we do is guess, that www.cnn.com is one.
All we do is guess, that www.cnn.com can be transformed into one and that it was
the intention of the author to do so.

> Nevertheless, I see nothing wrong in emphazising "x".
Nevertheless, I see nothing wrong in emphazising "/x/".
Comment 67 Ben Bucksch (:BenB) 1999-11-07 06:42:59 PST
I just noted another problem:
"<em class=txt_star>/italic/</em>" is doubled markup and conversion back to
plaintext (done as usual at Mozilla Mailnews) will result in "//italic//". (We
will stop here, because I don't convert "//italic//".) Of course, we can avoid
that in own our conversion, but not in the ones of other mailers (mail in
plaintext, reply via HTML, reply via plaintext).
Maybe a <span class=txt_star>/italic/</span>" is better. But this will take the
ability from other mailers to pretty-up the text (while it doesn't force to do
so).
Comment 68 Matthew Thomas, usability weenie 1999-11-07 16:31:59 PST
1. So the style-triggering characters are being left behind when the styles are
applied. Good. One less thing for GNUS users to gloat about.

2. In the same vein, why not just colorize smileys instead of replacing them with
a graphic? For example in ":-)", make the ":" blue (eyes), the "-" brown (nose),
and the ")" dark red (mouth)? That would get the effect across, without
corrupting accidental smileys, in the same way as making *this* bold gets the
effect across without corrupting *accidental* strings.

2. Here's a test case for you: Senator John Smith (R) said today that he was not
amused at Mozilla thinking his name was a registered trademark ...

3.
> But this will take the ability from other mailers to pretty-up the text (while
> it doesn't force to do so).

Let's get this straight ... are we applying all this formatting to *outgoing*
messages, or just to the display (and not the replying-to or forwarding) of
*incoming* messages? I sincerely hope it's just the latter ... people won't be
happy if, Eudora-Mail-like, you're misrepresenting the contents of forwarded
messages.

4. What I meant by converting >s to <blockquote>s is converting, for example,
> > foo!
> bar!

to <blockquote class="cite"><blockquote class="cite">foo!</blockquote>bar!</
blockquote>. But it's a bad idea, so don't do it. :-)

5. Section 6.1 of http://www.faqs.org/rfcs/rfc822.html says that the local part
of an e-mail address must be
word *("." word)
and I'd be surprised if `word' included brackets (it's not defined in the RFC).
But that might be outside the scope of this bug ...?
Comment 69 Ben Bucksch (:BenB) 1999-11-07 22:00:59 PST
Matthew Thomas,
I just noticed, that you brought that discussion to alt.ascii. I'm not sure, if
I like the discussion to be that open.

2. (The first) That's rhp's decission, but I like the graphics smilies. If they
start to annoy me, I'll disable Glyph substitution.

2. (The second) Tomorrow, will somebody say ":-)" is a valid word in some
language :-). Dunno, what to do with that. Anyone else?

(Two "2."s: That's the reason, why we have HTML mail :-).)

3. When I started implementing this, I assumed, I only change display. Later,
Akk told me, that we generate the quote in a HTML reply from the displayed msg.
This would include the smiley reference, which would break. Akk?

4. Can you explain, why that's a bad idea? Don't tell me, it's used in
ASCII-art.

5. "word" is defined in Section 3.3. ("(" may occur in the local part of an
email address.)
Comment 70 Matthew Thomas, usability weenie 1999-11-07 22:38:59 PST
0. Because (a) Mozilla is open source; (b) along with Forte (Free/non-free)
Agent, Mozilla is a popular choice for ASCII artists because it doesn't munge
ASCII art, *yet*; and (c) posting there was the best way I could think of to
solicit the opinions of the ASCII art community. I might be clever, but I can't
necessarily think of every impact this smoke-and-mirrors stuff will have on ASCII
art. The newsgroup as a whole has a better shot at being able to. Open source,
y'see.

3. Having the styles inserted only for display, not for replying/forwarding,
would solve the //italic// problem you described earlier, wouldn't it?

4. Because various clients use different symbols for quoting -- some use ">",
some "> ", others use ": " or "| ", some let the user select the symbol. You're
going to have a difficult job working out whether something's quoted or not, and
the net result will probably be making it look *more* of a mess.

There's a point, I think, where you've got to accept that most people who use
plain-text mail are doing so because they *want* plain-text mail. If they want
fancy formatting, they'll use HTML mail. So don't try to force too much fancy
formatting on their plain-text messages. Linkifying: fine. Emboldening/
italicizing: ok. Colorizing: perhaps. Inserting, deleting, or changing
characters: uh uh. Going too far.
Comment 71 Ben Bucksch (:BenB) 1999-11-07 23:22:59 PST
> (a) Mozilla is open source
Ah. Really? :-) (Note: This was just a joke.)

Posting to alt.ascii-art leads to a biased result, because groups with other
interests are not appropiately represented. We can't get a vote from the whole
usenet before each feature. If I knew, to which discussion this feature would
lead, I would not had started to implement it. And I don't know, if that is in
the best interest of our users.

> 4. Because various clients use different symbols for quoting -- some use ">",
> some "> ", others use ": " or "| "

The algorithm in Netscape Messenger 4.x works quite well.

> There's a point, I think, where you've got to accept that most people who use
> plain-text mail are doing so because they *want* plain-text mail.

I disagree. I'm almost certain, most users user plain text mail only for
compatibility reasons. If not, the web would be plain text plus links.
Comment 72 Akkana Peck 1999-11-08 10:59:59 PST
2. I don't much like the graphic smileys, but I'm sure someone will, and maybe
I'd get used to them.  I do think I would like the ability to have (R) turned
into a real trademark symbol (but mozilla doesn't do trademark symbols on Unix,
sadly, so until that bug is fixed, if we did that substitution we'd see nothing
at all instead of the (R)).

3. If you see it in the mail window, then that's what the message actually
contains as far as mozilla is concerned, and if you reply to it, that's what
you'll be replying to and you have to trust to the output system to convert it
back in a reasonable way.  The smiley glyph is a good point -- there's no code
there now to detect it and turn it back into a smiley.  Maybe we need to add
that.
4. Yes, plaintext quotes of recognized formats will be turned into blockquote
cites.  Currently, the only "recognized format" is a leading >.  This shouldn't
be a problem for people who use other quote characters (e.g. leading |) -- we'll
just keep those as plaintext quotes just as in the no-substitution case.  4.x
didn't understand quote characters other than "> " either (e.g. it didn't change
them to the user-defined quote color and font), but there didn't seem to be many
complaints about that.

Re why people use ascii mode: Put me down as someone who uses plaintext for
compatibility reasons.  If really do get to the point where we have reliable
substitution in both directions, I might embrace the new semi-ascii mode.  But
it seems fairly clear that we need to keep a "complete ascii" mode for people
who prefer that mode, in which no substitution at all is done, and one can rely
on ascii art, tables, etc. coming through unchanged.  In fact, we should have an
easy way of switching modes (something in the View menu, probably), so that if I
normally use substitution mode but someone sends me something that abviously has
ascii art in it and it's not displaying correctly, I can toggle a switch and see
the original message untouched.
Comment 73 Akkana Peck 1999-11-08 11:11:59 PST
Ugh.  We were just talking about smiley substitution on IRC, and I realized that
this will totally break an idiom I use a lot:

Some text (with a little joke :-) and some more text

In other words, I use the ) in a smiley to close a parenthetical expression as
well as to be the mouth of the smiley, because I don't like having two
close-parens next to each other.  Now people using mozilla will see all my
parens as being unbalanced. :-(  Parsing for paren balancing to see if a smiley
is being used this way sounds nontrivial, though.
Comment 74 Matthew Thomas, usability weenie 1999-11-08 11:46:59 PST
Yes, I use the bracket-smiley combo too (like this:-). Which is one of the
reasons I suggest a smiley be colorized, rather than converted to a graphic.

The bottom line is that I have little problem with various styles being applied
to certain strings, but I draw the line at actually changing the text. And I
would apply the same principle to blockquote citing, because it'll do the wrong
thing to this (for example):
> > > IMPORTANT ANNOUNCEMENT!!! < < <

Anyway, having a toggle item in the View menu for `Smart Styling' sounds like a
good idea, as long as its value is persistent (it stays the same between messages
and between sessions).
Comment 75 Ben Bucksch (:BenB) 1999-11-08 13:43:59 PST
Akk,
3. I don't think, I like the way we create HTML replies. Preparing display and
creating content are different tasks.
We *have* to convert the smily substitution back: I don't think, Outlook Express
will be able to use the "chrome://" URLs - data loss bug. (R) is a similar
problem. If we currently don't display &reg;, how can we trust, that all
recipients do?
Structures phrases are not that bad, since I don't remove content anymore, but
if we misstyle quotes, others could smile about us and our users; confusion and
flames are possible reactions, too.

4. I have no problem with "> > > foo < < <" being interpreted as quote: this is
because the sender ignored widely used internet rules and no real content is
lost.

> Now people using mozilla will see all my parens as being unbalanced. :-(

They *are* unbalanced.

Matthew Thomas,
I think, what you want is a pref. The idea behind the toggle-menuitem/-icon is
exactly the per-message basis.
Comment 76 Matthew Thomas, usability weenie 1999-11-08 16:02:59 PST
When replying, Moilla shouldn't be converting the converted text->HTML back to
some semblance of the original text; it should be using the exact text from the
message source. Wouldn't that avoid a whole lot of hassle with data loss or
corruption? The text->HTML conversion should be used only for display purposes,
IMO.

4. I don't think there's such a thing as a `widely used Internet rule' for
quoting. Sure, make >ed text smaller, italic, green, or whatever. But 3 + 2
> 4 ... which is why you should leave the > symbol there, because it *might* be being used for something other than quoting, as I just showed in that equation. Just like the *asterisks* or the /slashes/ *might* be being used for something other than emphasis.

And yes, Ben, I do want a pref. And I do want it in the View menu, not hidden
away in the prefs dialog, for the same reason rot13 belongs in the View menu and
not in the prefs dialog -- because it's something you generally need instant
access to.
Comment 77 Ben Bucksch (:BenB) 1999-11-09 20:25:59 PST
Akk, "(bla :))"-problem:
there was a discussion on alt.ascii-art, thread "Smiley-face query"
<URL:http://www.deja.com/viewthread.xp?search=thread&recnum=%3c68j7dn$ccq@tron.sci.fi%3e%231/1>
about this.
One quote: "In any case, it'd say we can all agree on the following points:
[...] Emoticons cannot close a parenthesis."
(<URL:http://x21.deja.com/getdoc.xp?AN=312573064>).
Comment 78 Ben Bucksch (:BenB) 1999-11-12 14:28:59 PST
I've changed the smily detection code (in my tree) to avoid problems with ":-))"
etc.
It searches after the following pattern: SPACE - Smily [- [.|,|;]] - WHITESPACE.
"WHITESPACE" means nsString::IsSpace return true, "[- [.|,|;]]" mean, that
optionally either ".", "," or ";" may appear after the smily (and stay in the
msg). Everything else is ignored.
Any objections?
Comment 79 Ben Bucksch (:BenB) 1999-11-12 15:39:59 PST
The latter was a bit ambiguously. My most recent changes avoid problems with
*smilies* like " :-)) ", " :-(( " etc., not the "Bla (bla :-) bla." (instead of
"Bla (bla :-)) bla.") problem.

I created bug #18718 and a dependency for the "graphical smily etc. in reply"
problem.
Comment 80 Ben Bucksch (:BenB) 1999-11-12 23:37:59 PST
Created attachment 2843 [details] [diff] [review]
[Preliminary] GlyphSubstitution, code, quote, class attributes
Comment 81 Ben Bucksch (:BenB) 1999-11-14 20:45:59 PST
Thanks to Daniel Bratell for pointing me to the (original) post.

>> Warren Harris wrote:
>>> We probably need an extensible way (based on protocols modules) to
>>> recognize strings as URLs. I don't know
>>> whether that needs to be a special method, or whether your code can just
>>> look for "<alphanumeric>*:" up to the
>>> next whitespace character, and then just try to construct a URL from it.
>>> If it succeeds, then highlight it, if not, don't.> Ben Bucksch wrote:
Warren Harris wrote:
> Ben Bucksch wrote:
> > "<alphanumeric>*:<non-whitespace>*" triggers far too often.
> > Users already complain, that "file://" urls are turned into links.
> >
> > Do you have an idea, how to make this dynamic?
>
> Yes, I think you/they should look for the pattern suggested above, and then
try calling
> NS_NewURI. This should only succeed if the protocol exists, and the string is
a
> syntactically valid URL.
>
> I guess we should special-case file: since that's almost always a link to the
sender's
> machine, and not the receiver's. Alternatively, we could call the
nsIFileChannel::Exists
> method and only highlight the URL if it's there.

I like this idea. Since the main purpose of nsMimeURLUtils is this
recognition, I started rewriting the class.
Comment 82 Ben Bucksch (:BenB) 1999-11-30 01:30:59 PST
WOW: txt2html.pl <http://www.thehouse.org/txt2html/>
Comment 83 chris hofmann 1999-12-07 22:42:59 PST
m14.  let me know if there are more changes ready for this in
in the next couple of days and we can see about getting them into m12.

maybe this is even post beta1?
Comment 84 Ben Bucksch (:BenB) 1999-12-08 01:54:59 PST
chofmann,
it has been checked in together with bug #19251 recently.
M12 FIXED.
Comment 85 Ben Bucksch (:BenB) 2000-02-15 00:32:18 PST
Docs at <http://www.bucksch.org/1/projects/mozilla/>
Comment 86 lchiang 2000-02-29 14:39:57 PST
I think it's safe to mark this verified.
The code is there.
Any specific bugs we find will be filed separately.
asj@ipa.net has graciously offered to test this feature.  He has started writing 
tests at: http://www.mozilla.org/quality/mailnews/tests/mn-html-to-txt.txt
Comment 87 [:Aureliano Buendía] 2009-10-19 02:34:41 PDT
*** Bug 522893 has been marked as a duplicate of this bug. ***
Comment 88 Doug Hockin 2009-10-21 09:19:15 PDT
My Bug 522893 was marked as a duplicate. In Bug 522893 it was recommended that I try out:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-1.9.1/

Which I just did. Upon install it "imported" my existing mail folders. The emails that had the problem in TBird 2 still have it in 3. I'll attach a screen shot from one of them. Doesn't appear fixed to me. Unless it somehow has to do with message storage format on disk and I need to test with freshly received messages?
Comment 89 Doug Hockin 2009-10-21 09:22:00 PDT
Created attachment 407543 [details]
Still truncates URLs in old saved messages

Screen shot of old saved (TBird 2) mail message that still has truncated URL when viewed in Tbird 3.
Comment 90 Ben Bucksch (:BenB) 2009-10-21 09:30:30 PDT
This bug is FIXED. Bug 522893 is not a duplicate, I'll reopen the latter.

Note You need to log in before you can comment on or make changes to this bug.