Closed
Bug 16800
Opened 25 years ago
Closed 25 years ago
Improve HTML -> plain text
Categories
(MailNews Core :: Composition, enhancement, P3)
MailNews Core
Composition
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: BenB, Assigned: BenB)
References
Details
Attachments
(5 files)
25.67 KB,
text/plain
|
Details | |
8.20 KB,
text/plain
|
Details | |
717 bytes,
patch
|
Details | Diff | Splinter Review | |
8.14 KB,
patch
|
Details | Diff | Splinter Review | |
6.11 KB,
patch
|
Details | Diff | Splinter Review |
Convert HTML's <strong>, <em> and <a href="..."> to *bold*, _italic_ and <URL:...> This bug was splitted off bug #16507, see description and early comments there.
_underline_, /italics/, *bold* See http://www.elsewhere.org/jargon_search/SEC13.html, near the bottom.
Comment 2•25 years ago
|
||
The best place to add these would be in nsHTMLToTXTSinkStream::OpenContainer() and CloseContainer(). See what it currently does for list items (eHTMLTag_li): you can call Write() with one character on opening the tag, and again (same or different character) on closing the tag. <b> and <i> should also work, since a lot of people use those tags instead of <strong> and <em>. You can assign this bug to me if you want, and I'll do it when I find time; adding the text types shouldn't take any time (I can do that in about 15 minutes plus testing); the URLs are a bit harder, as I tried to do that once before and ran into a snag of some sort. I'd like to argue about the URL format, though. What I was thinking about doing for URLs was to convert <a href="foo">bar</a> into bar (foo). If it converted into URL:foo, then I'd no longer be able to use double-click to select it in a standard plaintext window, which would be a major pain (I do that many times a day) -- I'd have to target the text after the colon, then carefully drag to the end of the word, instead of just targeting any part of the word and double-clicking.
Assignee | ||
Comment 3•25 years ago
|
||
Akk, there's some discussion about the form going on on n.p.m.mail-news in thread "Assuming "plain text" or "html mail"..", starting at news://news.mozilla.org/38054F6C.8624442@bucksch.org. Brenden pointed to RFC1738 ftp://venera.isi.edu/in-notes/rfc1738.txt, Appendix (Page 21), where <URL:...> is recommended.
Comment 4•25 years ago
|
||
Reading rfc1738, it does not seem to be absolutely required that no space follow the "<URL:" in <URL:foo://bar>. While the string "<URL:" is never followed by anything other than the first letter of the URL scheme anywhere in the RFC text, the only part of the RFC that describes the use of <URL:foo://bar>, the Appendix, does not explicitly rule out the use of <URL: foo://bar> in the same context. Here is the relevant passage: "In addition, there are many occasions when URLs are included in other kinds of text; examples include electronic mail, USENET news messages, or printed on paper. In such cases, it is convenient to have a separate syntactic wrapper that delimits the URL and separates it from the rest of the text, and in particular from punctuation marks that might be mistaken for part of the URL. For this purpose, is recommended [sic] that angle brackets "<" and ">"), along with the prefix "URL:", be used to delimit the boundaries of the URL. This wrapper does not form part of the URL and should not be used in contexts in which delimiters are already specified." From this context it seems clear that this syntax is wanted for human parsing, not machine parsing. The use of spaces inside the angle brackets is neither encouraged nor disallowed. On this basis, would it be possible to have our cake and follow the RFC too by using <URL: foo://bar> instead of <URL:foo://bar>? This would address the concerns akkana@netscape.com raised about being able to click on URLs in plain-text (especially useful for e-mail).
Assignee | ||
Comment 5•25 years ago
|
||
sidr, please post this to n.p.m.mailnews to the thread I've mentioned.
Comment 6•25 years ago
|
||
Trying every Windows mailer immediately available to me on NT (Messenger, Eudora, Outlook Express), all seem to make working links out of <URL:foo://bar>, so the objection akkana@netscape.com raises seems to be theoretical. My response proposing <URL: foo://bar> (with the space after "URL:") should be treated likewise.
Comment 7•25 years ago
|
||
Not theoretical, but something I do many times a day. You can set up xterms so that colons are considered part of words, but you can't set them up so that http: is part of a word but URL: is not. What's the benefit of the URL: part? I've seen it here and there, but I've never seen an application that requires it or does anything useful with it. Most applications I've seen already recognize http:// as a URL and don't need any further prefix.
Comment 8•25 years ago
|
||
Ok, it's not theoretical. I did ask in n.p.m.mailnews if anyone had experience with X MUAs, but I'm sure both messages passed each other in the mails :-) . In that case, would the "adapted" rfc1738 "compliant" <URL: foo://bar> work, or would the ">" at the end cause problems too? I suppose <URL: foo://bar > would fix that, but that is starting to look ugly. The point, as explained by TimBL et al in the appendix of rfc1738, is to make it easier for a human to tell the boundaries of a URL in plain-text writing. An example that might be seen in a mail message: "Have you tried <URL:http://www.foobar.nosuchtld/something.cgi>?" - without the angle-bracket delimiters, the "?" could look like part of the URL. Other punctuation is just as bad, it's easy to miss while copy-n-pasting. The other area where it helps is in URLs that get broken over lines.
Comment 9•25 years ago
|
||
The > doesn't cause a problem; since it isn't a legal charcter in a URL, it's easy to filter that out of the "word" definition. Personally, I think URL: is ugly and don't see why that part is necessary (I do understand wanting the angle brackets, fot the reasons you gave) but if it was separated from the actual URL by a space, I wouldn't have a functional objection to it, just the aesthetic one.
Assignee | ||
Comment 10•25 years ago
|
||
Akk, I don't want to be ignorant, but if we don't follow the RFC recommondation and create our own scheme (one more suggestion: <foo://bar>), I want to be sure, we have a real reason. I still don't understand your problem. Do you read your mail using a text based mailer? How often (applications, users) does the problem appear?
Comment 11•25 years ago
|
||
*** Bug 16958 has been marked as a duplicate of this bug. ***
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 12•25 years ago
|
||
*** Bug 12969 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 13•25 years ago
|
||
*** Bug 12969 has been marked as a duplicate of this bug. ***
Comment 14•25 years ago
|
||
This HTML->plain text conversion takes place not only when I send an HTML message to a plain-text recipient, but also when I copy some text from a Web page and paste it into a plain-text message, or into Notepad, or wherever, right?
Comment 15•25 years ago
|
||
Yes and no: yes, it goes through the same output converters, but no, it goes through with different flags (unformatted for copy/paste rather than formatted for mail) so this sort of conversion won't be done on copy/paste. Most people just want the raw text and nothing else when copy/pasting plain text.
Assignee | ||
Comment 16•25 years ago
|
||
Matty, could you please explain, what you mean with your comment on bug #16507? "Regarding translating em and strong to plain text, I'm not really sure, but it would be really nice to use a stylesheet. That sounds quite complicated though." I don't know, how we could use stylesheet to do a conversion. As I know them, they just apply attributes to SGML or XML tags. There're no such tags in plain text msgs.
Assignee | ||
Comment 17•25 years ago
|
||
Now, I'm completely confused :-). We have HTML of XIF as input here, on which we can apply stylesheets. But I still don't know, how to use that for a conversion to TXT.
Comment 18•25 years ago
|
||
I was referring to looking at the stylesheet, seeing that em maps to bold, and hence putting the plain text version in stars. But this assumes that stars means bold, which may or may not be the case.
Assignee | ||
Comment 19•25 years ago
|
||
Assignee | ||
Comment 20•25 years ago
|
||
Assignee | ||
Comment 21•25 years ago
|
||
The changes fix indention, ul and ol, nested lists and structured pharses and link, of course. Unfortunately, the diff's won't apply after the checkin of format=flowed. Nevertheless, I attached them, because the changes might the useful when fixing still remaining bugs in the new files. The syntax of nsString usage depends on bug #17641, but may be changed easily when 17641 doesn't make it in the tree.
Assignee | ||
Updated•25 years ago
|
Summary: HTML -> plain text: strong, em, a → Improve HTML -> plain text
Assignee | ||
Comment 22•25 years ago
|
||
I'll try to make lists working, too.
Assignee | ||
Comment 23•25 years ago
|
||
Assignee | ||
Comment 24•25 years ago
|
||
Assignee | ||
Comment 25•25 years ago
|
||
The patches are for version 3.19. Apart from support for a,img,em,i,strong,b, and sup (didn't know a plaintext equivalent to sub), they implement indention for lists, nested lists and improve the list bullets/numbers. The latter has still a minor problem (too many spaces), I'll send a new version of the patches when 17883 is fixed.
Comment 26•25 years ago
|
||
The patch looks good. We obviously need to make the indentation configurable; I made it 2 because that's the mozilla standard for code (not that that has anything to do with plaintext output); 4 would be my personal preference, 8 seems too much, but we should make it a preference so everyone can decide. I can add that later. On writing out the title in SetTitle: I don't think that's the right thing to do in plaintext, because it isn't something that's shown to the user when looking at the html page, and the goal of formatted plaintext output is to make the plaintext look as much as possible as the html looked to the user. So my inclination is to remove that part unless you can make a case for why we should be outputting the title. I still hate the URL: without a space since it means I can't select words by double-clicking, but I'll try to live with it for a while and see how much of an annoyance it turns out to be. Why did you remove the stripping of quotes from value.StripChars("\"").Equals("cite", PR_TRUE)? I had cases where it didn't work due to that (you can't always count on the quotes being there or not), so I think this needs to be there unless you know of some change that's happened recently that will guarantee quotes not being in the attribute string (which would be nice). Otherwise, it looks good, and we should take this whenever the tree opens for checkins.
Assignee | ||
Comment 27•25 years ago
|
||
Akk, The indention is a leftover from my testing. My preference would be 4, too. I agree, that this should be configurable. I'll try to build it in my final patch. SetTitle: Titles carry *very* important information, that isn't repeated in the body, in many of my HTML files. I do not agree, that plaintext should look like a browser output. Not only that HTML isn't about the look, plaintext has it's own way of representing meanings. What I think, we should do, is trying to take the information the HTML file carries and produce a plaintext document how it would look like, if the document were created directly in plaintext. Of course, it doesn't make sense to print it in this form in an email. I forgot to add a test for emptyness, in which the title isn't written. Can we figure out a system, that works for both (are there more?) cases? Maybe defaulting the title to zero-length? URL: You didn't answer my questions (see above) yet, so I had to decide. It extremly simple to change (just search for occurances of "URL:", note the plural). StripChars: I don't remember, how I came to the conclusion, that this isn't needed. I think, I made a test with a node with quotes and |value| didnÄt contain them. If you have problems, let it in.
Assignee | ||
Comment 28•25 years ago
|
||
Status update: I'll see, what I can do for tables now.
Assignee | ||
Comment 29•25 years ago
|
||
Status update: I'll see, what I can do for tables now. (My mail bounces at the moment :-((. Until fixing, send mail to ben.bucksch@uumail.de.)
Comment 30•25 years ago
|
||
I think the presence or absence of quotes is affected by whether we're parsing from html or from XIF. The answer to the question about URL: without the space is in my comment about xterms -- it applies to any app (including a mail program) which runs in an xterm. I have a patch which integrates Ben's work, a quoting fix from Daniel, and some patches of my own into one, which I'm attaching to this bug.
Comment 31•25 years ago
|
||
Comment 32•25 years ago
|
||
Oh, forgot to address the SetTitle issue: I was under the impression that most web pages (certainly mine are like this) basically repeat the title information in an H1 or similar at the top of the page. So if we output the title, the output will show two identical or nearly identical lines, one of which wasn't visible at all to the person viewing the html page. Title is meta-information which, in my o, doesn't belong in the output. Any lurkers on this bug want to add their opinions?
Assignee | ||
Comment 33•25 years ago
|
||
For everybody not wanting to analyse the patch: Akks changes are including a space after "URL:", ifdef'ing SetTitle, reincluding StripChars, 2 fixes for "VerticalSpace" and adding support for <hr>. The latter should be changed. Using wrap width should be easy and done (she expressed intention to do so) and just a "Write" will do the wrong think from my understanding of Daniel's new functions. You should make sure, the hr is on its own line.
Assignee | ||
Comment 34•25 years ago
|
||
Akk, it would be nice, if we could carry discussion about opinions out to the newsgroup, they provide better structure for that (e.g. threading). This page is already long enough. I suggest mail-news.
Assignee | ||
Comment 35•25 years ago
|
||
Opened bug #18012 to avoid a superbug.
Assignee | ||
Comment 36•25 years ago
|
||
Shaver seems to have been right. RFC1855 <URL:ftp://venera.isi.edu/in-notes/rfc1855.txt>: "Use underscores for underlining."
Assignee | ||
Updated•25 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•24 years ago
|
Severity: normal → enhancement
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•